Skip to content
Snippets Groups Projects
Jari Helaakoski's avatar
Jari Helaakoski authored
QtWS2025 release

See merge request !3
f8c58d4f
History

Qt AI app

This example demonstrates how to easily use the Qt AI Inference API in your QML app, chaining different models together. Note, the API is not yet released, and is still in the early phase of development, and should be considered as a proof-of-concept.

Installing dependencies

For building and running the app, you will need:

Building the app

Clone the source codes and initialize the submodule containing the API:

git clone git@git.qt.io:ai-ui/qt-ai-app.git
cd qt-ai-app
git submodule init
git submodule update

After this, open the qt-ai-app/CMakeLists.txt in Qt Creator and build the project.

Running the app

Before running the app, we need to install the proper models, and set up a simple python server for our requests to Whisper.

First, let's install the models from ollama. The models used in this example by default are deepseek-r1 (TODO update if needed) for the text-to-text model, and llava-phi3 for image-to-text. You can also change these if you want to test out different models; just remember to install those ones as well.

    ollama pull deepseek-r1
    ollama pull llava-phi3

After that, run the python script under qt-ai-inference-api/aimodel/asr_server to set up the simple server for Whisper:

python asr_server.python

And the one under qt-ai-inference-api/aimodel/tts_server for the TTS server for Piper:

python piper_server.python

After this, build and run the app with Qt Creator!

App walkthrough

In the app, most of the interesting things happen in AIPrompt.ui.qml. First, we set up the UI elements: a WidgetHeader acts as the header for the app window, a ChatView to display the messages to-and-from the AI model, and AIPromptContainer which contains the controls for sending input for the model. Lastly, we also add a component called AiComponents, which contains the AI functionalities - more on that soon!

WidgetHeader {
    id: widgetHeader
    ...
}

ChatView {
    id: conversationContainer
    ...
}        

AIPromptContainer {
    id: aIPromptContainer
    ...

}
...

AiComponents {
    anchors.fill: parent
    promptContainer: aIPromptContainer
    chatView: conversationContainer
    widgetHeader: widgetHeader
}

The ChatView is just a ListView with a custom ChatMessageModel, implemented in ChatMessageModel.h and ChatMessageModel.cpp. The model has three roles, the sender (user/AI), the text content, and a role for a URL to a image. Similarly the AIPromptContainer and WidgetHeader are pretty straightforward, implementing things like a text field and a few buttons for sending different types of input to the model, and in case of the header a menu button.

Now, let's have a look inside the AiComponents.qml, where we get to the core of this example, the MultiModals from the new Qt AI Inference API. We set up four of them, with different input and output types. First, we have the speech-to-text model:

MultiModal {
    id: speechToText
    type: (MultiModal.InputAudio | MultiModal.OutputText)
    optional: true
    model: "turbo"
    onGotResult: (result) => {
                    conversationContainer.addMsg("user", result, "")
                 }
}

We set the type flags to MultiModal.InputAudio and MultiModal.OutputText to indicate the expected input and output types, respectively. We also set the "optional" property to true to tell that if this model is set as an input for another model, it should be seen as an optional input, meaning the other model shouldn't wait for it to produce something to process other inputs. We also define the model we want the Whisper backend to use, and add a signal handler for the gotResult() signal to add the result to the ChatView for the "user" role.

Next, we set up the image-to-text model. Here, we also use the "buffered" property to show that the model should buffer the latest result for later use. Additionally, we use the "prompt" property to give it a prompt it will use along with each picture provided to it, telling it to describe the contents of the received image.

MultiModal {
    id: imageToText
    type: (MultiModal.InputImage | MultiModal.OutputText)
    optional: true
    buffered: true
    model: "llava-phi3"
    prompt: "What is in the picture?"
}

Then it's time to set up the text-to-text LLM, which is the one providing the output into the ChatView! We again set the corresponsing input and output types, and the model we want to use. Both the text-to-text and image-to-text models use ollama as their backend, and the model property is used to define which model ollama should run. Next, we define inputs for this model - the speech-to-text and image-to-text models we added earlier! This chains the models together so that when either of the earlier models produces a result, it's passed to this model, letting you effortlessly combine different types of inputs and models to a single pipeline. Finally, when this model produces a result, it's posted into the ChatView as a response from the AI.

MultiModal {
    id: llamaModel
    type: (MultiModal.InputText | MultiModal.OutputText)
    model: "deepseek-r1"
    inputs: [ imageToText, speechToText ]

    onGotResult: (result) => {
                        aIprompt.addMsg("AI", result, "")
                   }
}

The last piece of the pipeline is a text-to-speech model, which takes the text-to-text model as input unless disabled from the settings, so the replies from the AI are also read out aloud. To make it optional, we add the llamaModel as its input only if it has been enabled in the settings.

MultiModal {
    id: text2speech
    type: (MultiModal.InputText | MultiModal.OutputAudio)
    inputs: settingsView.text2SpeechOn ? [ llamaModel ] : []
}

The last thing to do is to set up a few elements to help pass the inputs to the models. We use QtMultimedia to set up a recorder for the audio, connect it to AIPromptContainer voice button, and push the recorded audio file as input for the speech-to-text model.

CaptureSession {
    audioInput: AudioInput {}
    recorder: MediaRecorder {
        id: recorder
        mediaFormat {
            fileFormat: MediaFormat.Wave
        }
    }
}

Connections {
    target: promptContainer
    ...

    onVoiceButtonClicked: {
        recorder.record()
    }

    onVoiceButtonReleased: {
        recorder.stop()
        if (recorder.actualLocation !== "") {
            speechToText.pushDataFromFile(recorder.actualLocation)
        }
    }
}

Similarly, we add FileDialog for browsing images, which will push the chosen image to the image-to-text model, as well as to the ChatView model.

FileDialog {
    id: fileDialog
    folder: StandardPaths.standardLocations(StandardPaths.PicturesLocation)[0]
    nameFilters: ["*.*"]
    onAccepted: {
        imageToText.pushDataFromFile(fileDialog.file)
        addMsg("user", "" , fileDialog.file)
    }
}

Connections {
    target: promptContainer

    ...

    onAttachButtonClicked: {
        fileDialog.open()
    }
}

And that's it! Lastly, if you want to change the models used by the MultiModals at runtime, click the menu icon on the top right to open a simple settings view to edit them.