Bunny L1 Mini App Overview (AI Vision Example)
This Mini-App is a simple example of an AI Vision application that can be used to ask questions about whats infront of the camera.
Features
- Describe what's infront of you: By tapping the shutter button, the app will take a picture and send it to the backend to describe what's in the image.
- Describe what's in the picture: Select a picture from your photo library to do the same as above.
- Ability to flip the camera: Press on the flip camera button to switch between the front and back camera.
- Ask questions about what's infront of you: By holding the shutter button, you can ask specific questions about what's infront of you.
- Responds out loud: The mini app will speak out loud whenever it has a response to your question.
- AnalyticsKit Integration: If AnalyticsKit is selected during project generation, events and errors regarding user
interacting with the mini-app will automatically be tracked with
aikit
as the source (both client-side and server-side).
Demo
Here's a demo of the Bunny L1 Mini-App in action:
Implementation
Overview
Objects and Their Responsibilities
- AIVisionExampleViewModel: This view model handles interaction with the user interface, as well as holds the functions which are called on state changes (e.g.
detectedCapturedCameraImageUpdate
). - CameraViewModel: This view model is responsible for handling user interactions related to the camera and passes it down to the
CameraManager
(e.g.flipCamera
). - CameraManager & CameraDelegate: These models are responsible for setting up the camera, taking pictures and calling other the system-level Camera APIs.
- CameraView: This model is responsible for showing a preview feed of the camera.
- VoiceRecordingViewModel: This view model is responsible for handling user interactions related to voice recording (e.g.
shouldStartRecording
). - VoiceRecordingManager: This manager is directly responsible for handling the recording of the voice input and calling the
VoiceTranscriptionModel
to transcribe the audio. - VoiceTranscriptionModel: This model is responsible for transcribing the audio using the bundled whisper model.
- GeneralAudioModel: This model is responsible for playing audio files.
- DB: This object is responsible for handling the communication with the backend and authentication
(
FirebaseBackend.swift
), as well as interacting with BackendKit functions (BackendFunctions.swift
).
Why so many? Because object oriented programming and ✨abstractions✨.
Showing a Camera Feed & Getting Camera Permissions
We show a camera feed using a custom AVCaptureVideoPreviewLayer
wrapper called CameraView
. We attach a
.requireCapabilityPermission()
of type .cameraAccess
to it, to show a permission dialog if the user hasn't greanted us permission
to use the camera. The state of permission is stored in the gotCameraAccess
boolean value of AIVisionExampleViewModel
.
Read more about requesting camera permission in Request Permission to access the Camera, Microphone, Location, etc..
// ...
struct AIVisionExampleView: View {
// ...
@StateObject private var vm = AIVisionExampleViewModel()
@StateObject private var cameraViewModel = CameraViewModel()
@State private var orientation = UIDevice.current.orientation
// ...
var body: some View {
VStack {
CameraView(session: cameraViewModel.session, orientation: $orientation)
// ...
.requireCapabilityPermission(
of: .cameraAccess,
onSuccess: {
vm.gotCameraPermissions(cameraVM: cameraViewModel)
},
onCancel: /* close mini app */
)
// ...
HStack {
// ... photos picker
Button {
// ... shutter button let go
} label: {
// ...
}
.buttonStyle(ShutterButton())
// Disable the button if we are currently processing or if we don't have camera access
.disabled(vm.processing || !vm.gotCameraAccess)
// ... flip camera button
}
}
}
}
Getting a Picture and a Prompt
There are multiple ways to use this mini-app:
- By taking a picture with the camera and using a hard-coded "What's in the picture?" prompt.
- By selecting a picture from the photo library and using a hard-coded "What's in the picture?" prompt.
- By holding down the shutter button and asking a question about what's in front of you.
Taking a Picture and Describing it
When the user presses on the shutter button, we tell that to the view model (AIVisionExampleViewModel
),
which calls the captureImage()
method of CameraViewModel
, which then tells the CameraManager
take the picture.
As soon as the picture is taken (the CameraDelegate
sets the capturedImage
observed value to the image), the detectedCapturedCameraImageUpdate()
method of the
AIVisionExampleViewModel
is called, which then sends the image to the backend with a prompt to describe it.
// ...
struct AIVisionExampleView: View {
// ...
@StateObject private var vm = AIVisionExampleViewModel()
@StateObject private var cameraViewModel = CameraViewModel()
@StateObject private var voiceRecordingViewModel = VoiceRecordingViewModel()
// ...
var body: some View {
VStack {
// ... camera feed
HStack {
// ... photos picker
Button {
vm.releasedShutterButton(cameraVM: cameraViewModel, voiceRecordingVM: voiceRecordingViewModel)
} label: {
// ...
}
.buttonStyle(ShutterButton())
// ...
// ... flip camera button
}
}
.onChange(of: cameraViewModel.capturedImage) {
vm.detectedCapturedCameraImageUpdate(
db: db,
cameraVM: cameraViewModel,
voiceRecordingVM: voiceRecordingViewModel
)
}
}
}
Once we receive a response from the backend, we play the audio that describes what's in the picture. See the backend code below.
// ...
class AIVisionExampleViewModel: ObservableObject {
// ...
func releasedShutterButton(cameraVM: CameraViewModel, voiceRecordingVM: VoiceRecordingViewModel) {
// ...
cameraVM.captureImage()
// ...
}
@MainActor func detectedCapturedCameraImageUpdate(db: DB, cameraVM: CameraViewModel, voiceRecordingVM: VoiceRecordingViewModel) {
// ...
guard let capturedImageBase64 = capturedImg.base64 else { /* ... */ }
// ...
if let result = await db.processImageWithAI(
jpegImageBase64: capturedImageBase64,
processingCommand: voiceRecordingVM.currentAudioTranscription ?? "Describe exactly what you see.",
readResultOutLoud: true // hard-coded to always read out loud
)
{
/// If we got audio from the server (we should get it if we set `readResultOutLoud`to true), read the audio out loud. Otherwise show the result as a text in a notification
if let audio = result.audio {
voiceRecordingVM.generalAudioModel.playAudio(base64Source: audio)
} else {
showInAppNotification(/* show in-app notification if not reading out loud */)
}
} else {
// ... error handling
}
// ...
}
// ...
}
Picking an Image from the Photo Library and Describing it
We also support the ability to pick an individual picture from user's photo library and describe it.
For this, we use PhotoUI's PhotosPicker
(opens in a new tab). After the user
selectes an image, we save it in the selectedImagePickerItem
variable of the AIVisionExampleViewModel
. When that value changes,
photoSelectedFromLibrary()
function is called, which converts the image into a UIImage and sets it to the capturedImage
variable (which in
the previous example was set after pressing the shutter button and image capture). The rest of the process is the same as above.
// ...
struct AIVisionExampleView: View {
// ...
@StateObject private var vm = AIVisionExampleViewModel()
@StateObject private var cameraViewModel = CameraViewModel()
@StateObject private var voiceRecordingViewModel = VoiceRecordingViewModel()
// ...
var body: some View {
VStack {
// ... camera feed
HStack {
PhotosPicker(
selection: $vm.selectedImagePickerItem,
matching: .images,
preferredItemEncoding: .compatible
) {
Image(systemName: "photo.stack")
// ...
}
.onChange(of: vm.selectedImagePickerItem) {
vm.photoSelectedFromLibrary(cameraVM: cameraViewModel)
}
// ... shutter button
// ... flip camera button
}
}
.onChange(of: cameraViewModel.capturedImage) {
vm.detectedCapturedCameraImageUpdate(
db: db,
cameraVM: cameraViewModel,
voiceRecordingVM: voiceRecordingViewModel
)
}
}
}
// ...
class AIVisionExampleViewModel: ObservableObject {
// ...
@Published var selectedImagePickerItem: PhotosPickerItem? = nil
// ...
func photoSelectedFromLibrary(cameraVM: CameraViewModel) {
// ...
if let data = try? await selectedImagePickerItem?.loadTransferable(type: Data.self) {
// ...
await MainActor.run {
cameraVM.capturedImage = UIImage(data: data)
}
// ...
}
}
// ...
}
With a Custom Image Prompt
The Bunny L1 example also includes the ability to ask custom questions regarding what is in front of you.
This is done by holding down the shutter button, which starts the voice recording. When the user releases the button,
the voice recording stops and the audio is transcribed using the VoiceTranscriptionModel
via a local whisper model.
We use the isCurrentlyRecording
variable of the VoiceRecordingViewModel
to detect whether the shutter button release
should just capture an image or start processing.
Holding down the shutter button, sets the isCurrentlyRecording
variable to true
and starts the recording.
So if the button is let go, we check: Is the user currently recording? If yes, we stop the recording and take a picture. If there
is no recording going on (which means, the user didn't hold down), we just take a picture.
The recording and transcription is done analogously to the AI Translator Example. After
the recording is finished and its transcription is saved to the currentAudioTranscription
variable of the VoiceRecordingViewModel
, the detectedCapturedCameraImageUpdate()
function is called, which captures the image, which then triggers detectedCapturedCameraImageUpdate()
.
The server request is then done the same way as in previous examples, with
the addition that the prompt is set to currentAudioTranscription
instead of being automatically set to "Describe exactly what you see".
// ...
struct AIVisionExampleView: View {
// ...
@StateObject private var vm = AIVisionExampleViewModel()
@StateObject private var cameraViewModel = CameraViewModel()
@StateObject private var voiceRecordingViewModel = VoiceRecordingViewModel()
// ...
var body: some View {
VStack {
// ... camera feed
HStack {
// ... photos picker
Button {
// long press over
vm.releasedShutterButton(cameraVM: cameraViewModel, voiceRecordingVM: voiceRecordingViewModel)
} label: {
// ...
}
.buttonStyle(ShutterButton())
.simultaneousGesture( // gesture to detect long press
LongPressGesture(minimumDuration: 0.2).onEnded { _ in
askUserFor(.microphoneAccess) {
voiceRecordingViewModel.shouldStartRecording()
} onDismiss: {
showInAppNotification( /* ... no mic access warning */ )
}
})
// ...
// ... flip camera button
}
}
.onChange(of: cameraViewModel.capturedImage) {
vm.detectedCapturedCameraImageUpdate(
db: db,
cameraVM: cameraViewModel,
voiceRecordingVM: voiceRecordingViewModel
)
}
.onChange(of: voiceRecordingViewModel.currentAudioTranscription) {
vm.detectedAudioTranscriptionUpdate(
cameraVM: cameraViewModel,
voiceRecordingVM: voiceRecordingViewModel
)
}
}
}
class VoiceRecordingViewModel: ObservableObject {
// ...
@Published public private(set) var isCurrentlyRecording: Bool = false
@Published var currentAudioTranscription: String? = nil
@ObservedObject var voiceRecordingManager = VoiceRecordingManager()
// ...
func shouldStartRecording() {
// ...
// ... calls to the voice recording manager to start recording
isCurrentlyRecording = true
// ...
}
func shouldStopRecording() {
// ...
// ... calls to the voice recording manager to stop recording and transcribe the audio
isCurrentlyRecording = false
// ...
}
// ...
}
class AIVisionExampleViewModel: ObservableObject {
// called when the shutter button is released in the AIVisionExampleView
func releasedShutterButton(cameraVM: CameraViewModel, voiceRecordingVM: VoiceRecordingViewModel) {
// ...
// are we currently recording? (holding down)
if voiceRecordingVM.isCurrentlyRecording { // yes, we are recording
// ...
voiceRecordingVM.shouldStopRecording()
// ...
} else { // no, it was just a shutter button tap
cameraVM.captureImage()
}
}
// called by the .onChange in the AIVisionExampleView
@MainActor func detectedCapturedCameraImageUpdate(db: DB, cameraVM: CameraViewModel, voiceRecordingVM: VoiceRecordingViewModel) {
// ...
if let result = await db.processImageWithAI(
jpegImageBase64: capturedImageBase64,
processingCommand: voiceRecordingVM.currentAudioTranscription ?? "Describe exactly what you see.",
readResultOutLoud: true)
{
// ...
voiceRecordingVM.generalAudioModel.playAudio(base64Source: audio)
// ...
} else {
// ... error handling
}
// ...
}
// called by the .onChange in the AIVisionExampleView
@MainActor func detectedAudioTranscriptionUpdate(
cameraVM: CameraViewModel,
voiceRecordingVM: VoiceRecordingViewModel
) {
// ...
cameraVM.captureImage()
}
}
Interaction with the Backend
We access the backend via a dedicated function processImageWithAI()
that extends the DB
object in BackendFunctions.swift
that sends the image and the prompt to the backend and returns the vision result.
// front-end
extension DB {
public struct ImageAnalysisResultType: Codable {
public let message: String // ai response to the image
public let audio: String? // base64 mp3 audio of the message above (if read out loud)
// init using raw data from the endpoint
public init(from rawData: [String: Any]) throws { /* ... */ }
}
// jpegImageBase64: base64 of the image that should be processed
// processingCommand: the ai prompt to process the image with
// readResultOutLoud: should the result be read out loud? if yes, an mp3 audio will be returned in addition to the textual response
public func processImageWithAI(
jpegImageBase64: String,
processingCommand: String,
readResultOutLoud: Bool
) async
-> ImageAnalysisResultType?
{ }
}
The Backend endpoint that will return a result of type ImageAnalysisResultType
:
type ImageAnalysisResultType = {
message: string;
audio: string | null;
};
// back-end
import * as AI from "./AIKit/AI";
// ...
export const analyzeImageContents = onCall(async (request) => {
let fromUid = request.auth?.uid;
// ...
const imageBase64 = request.data?.imageBase64 as string | null;
const processingCommand = request.data?.processingCommand as string | null;
const readOutLoud = request.data?.readOutLoud as boolean | false;
// ...
const imageAnalysisResult = await AI.accessGPTVision(
imageBase64,
`${
readOutLoud // prompt prefix for the AI to be better at talking out loud
? "You have to answer user's command as if you were speaking to a human. The following is a question/prompt and you must answer it as naturally as possible, but dont yap for too long. Be only verbose when asked directly or necessary in this context: "
: ""
}${processingCommand}`
);
// ...
if (!readOutLoud) {
const result: ImageAnalysisResultType = {
message: imageAnalysisResult,
audio: null,
};
// ...
return result;
}
const audioBufferResult = await AI.convertTextToMp3Base64Audio(
imageAnalysisResult
);
// ...
const result: ImageAnalysisResultType = {
message: imageAnalysisResult,
audio: audioBufferResult.toString("base64"),
};
// ...
return result;
// ...
});
The endpoint above uses two pre-made AIKit functions accessGPTVision()
and convertTextToMp3Base64Audio()
to analyze an image with the provided prompt
and convert the result to audio.
accessGPTVision()
sends an image and a prompt to the GPT-4o vision model and returns an response.convertTextToMp3Base64Audio()
takes text as input, converts it to audio using OpenAI's TTS (Text-To-Speech) model, and returns the audio as a base64 string.
// back-end
import OpenAI from "openai";
import { openAIApiKey } from "../config";
const openai = new OpenAI({
apiKey: openAIApiKey,
});
export async function accessGPTVision(
imageBase64: string,
imageProcessingCommand: string
): Promise<string | null> {
const response = await openai.chat.completions.create({
// ... chat completions with vision
// https://platform.openai.com/docs/guides/vision
});
return response.choices[0].message.content;
// ...
}
export async function convertTextToMp3Base64Audio(
text: string
): Promise<Buffer | null> {
// ...
const mp3 = await openai.audio.speech.create({
/* ... */
});
// ...
const buffer = Buffer.from(await mp3.arrayBuffer());
return buffer;
// ...
}