📦 SwiftyLaunch Modules
✨ AIKit
AI Vision Mini-App

Bunny L1 Mini App Overview (AI Vision Example)

AIKit (SwiftyLaunch Module) - AI Vision Example

This Mini-App is a simple example of an AI Vision application that can be used to ask questions about whats infront of the camera.

Features

  • Describe what's infront of you: By tapping the shutter button, the app will take a picture and send it to the backend to describe what's in the image.
  • Describe what's in the picture: Select a picture from your photo library to do the same as above.
  • Ability to flip the camera: Press on the flip camera button to switch between the front and back camera.
  • Ask questions about what's infront of you: By holding the shutter button, you can ask specific questions about what's infront of you.
  • Responds out loud: The mini app will speak out loud whenever it has a response to your question.
  • AnalyticsKit Integration: If AnalyticsKit is selected during project generation, events and errors regarding user interacting with the mini-app will automatically be tracked with aikit as the source (both client-side and server-side).

Demo

Here's a demo of the Bunny L1 Mini-App in action:

Implementation

Overview

AI Voice Example Architecture Overview

Objects and Their Responsibilities

  • AIVisionExampleViewModel: This view model handles interaction with the user interface, as well as holds the functions which are called on state changes (e.g. detectedCapturedCameraImageUpdate).
  • CameraViewModel: This view model is responsible for handling user interactions related to the camera and passes it down to the CameraManager (e.g. flipCamera).
  • CameraManager & CameraDelegate: These models are responsible for setting up the camera, taking pictures and calling other the system-level Camera APIs.
  • CameraView: This model is responsible for showing a preview feed of the camera.
  • VoiceRecordingViewModel: This view model is responsible for handling user interactions related to voice recording (e.g. shouldStartRecording).
  • VoiceRecordingManager: This manager is directly responsible for handling the recording of the voice input and calling the VoiceTranscriptionModel to transcribe the audio.
  • VoiceTranscriptionModel: This model is responsible for transcribing the audio using the bundled whisper model.
  • GeneralAudioModel: This model is responsible for playing audio files.
  • DB: This object is responsible for handling the communication with the backend and authentication (FirebaseBackend.swift), as well as interacting with BackendKit functions (BackendFunctions.swift).
💡

Why so many? Because object oriented programming and ✨abstractions✨.

Showing a Camera Feed & Getting Camera Permissions

We show a camera feed using a custom AVCaptureVideoPreviewLayer wrapper called CameraView. We attach a .requireCapabilityPermission() of type .cameraAccess to it, to show a permission dialog if the user hasn't greanted us permission to use the camera. The state of permission is stored in the gotCameraAccess boolean value of AIVisionExampleViewModel.

Read more about requesting camera permission in Request Permission to access the Camera, Microphone, Location, etc..

Requesting Camera Permission

AIVisionExampleView.swift
// ...
struct AIVisionExampleView: View {
 
    // ...
    @StateObject private var vm                 = AIVisionExampleViewModel()
    @StateObject private var cameraViewModel    = CameraViewModel()
    @State private var orientation              = UIDevice.current.orientation
    // ...
 
    var body: some View {
        VStack {
            CameraView(session: cameraViewModel.session, orientation: $orientation)
			// ...
			.requireCapabilityPermission(
				of: .cameraAccess,
				onSuccess: {
					vm.gotCameraPermissions(cameraVM: cameraViewModel)
				},
				onCancel: /* close mini app */
			)
            // ...
            HStack {
                // ... photos picker
 
                Button {
                    // ... shutter button let go
				} label: {
                    // ...
                }
				.buttonStyle(ShutterButton())
				// Disable the button if we are currently processing or if we don't have camera access
				.disabled(vm.processing || !vm.gotCameraAccess)
 
                // ... flip camera button
            }
        }
    }
}

Getting a Picture and a Prompt

There are multiple ways to use this mini-app:

  • By taking a picture with the camera and using a hard-coded "What's in the picture?" prompt.
  • By selecting a picture from the photo library and using a hard-coded "What's in the picture?" prompt.
  • By holding down the shutter button and asking a question about what's in front of you.

Taking a Picture and Describing it

When the user presses on the shutter button, we tell that to the view model (AIVisionExampleViewModel), which calls the captureImage() method of CameraViewModel, which then tells the CameraManager take the picture.

As soon as the picture is taken (the CameraDelegate sets the capturedImage observed value to the image), the detectedCapturedCameraImageUpdate() method of the AIVisionExampleViewModel is called, which then sends the image to the backend with a prompt to describe it.

AIVisionExampleView.swift
// ...
struct AIVisionExampleView: View {
 
    // ...
    @StateObject private var vm                      = AIVisionExampleViewModel()
    @StateObject private var cameraViewModel         = CameraViewModel()
    @StateObject private var voiceRecordingViewModel = VoiceRecordingViewModel()
    // ...
 
    var body: some View {
        VStack {
            // ... camera feed
 
            HStack {
                // ... photos picker
 
                Button {
                    vm.releasedShutterButton(cameraVM: cameraViewModel, voiceRecordingVM: voiceRecordingViewModel)
				} label: {
                    // ...
                }
				.buttonStyle(ShutterButton())
                // ...
 
                // ... flip camera button
            }
        }
        .onChange(of: cameraViewModel.capturedImage) {
			vm.detectedCapturedCameraImageUpdate(
				db: db,
                cameraVM: cameraViewModel,
                voiceRecordingVM: voiceRecordingViewModel
            )
		}
 
    }
}

Once we receive a response from the backend, we play the audio that describes what's in the picture. See the backend code below.

AIVisionExampleViewModel.swift
// ...
class AIVisionExampleViewModel: ObservableObject {
    // ...
	func releasedShutterButton(cameraVM: CameraViewModel, voiceRecordingVM: VoiceRecordingViewModel) {
		// ...
        cameraVM.captureImage()
        // ...
	}
 
	@MainActor func detectedCapturedCameraImageUpdate(db: DB, cameraVM: CameraViewModel, voiceRecordingVM: VoiceRecordingViewModel) {
        // ...
		guard let capturedImageBase64 = capturedImg.base64 else { /* ... */ }
		// ...
        if let result = await db.processImageWithAI(
                jpegImageBase64: capturedImageBase64,
                processingCommand: voiceRecordingVM.currentAudioTranscription ?? "Describe exactly what you see.",
                readResultOutLoud: true // hard-coded to always read out loud
            )
        {
 
            /// If we got audio from the server (we should get it if we set `readResultOutLoud`to true), read the audio out loud. Otherwise show the result as a text in a notification
            if let audio = result.audio {
                voiceRecordingVM.generalAudioModel.playAudio(base64Source: audio)
            } else {
                showInAppNotification(/* show in-app notification if not reading out loud */)
            }
 
        } else {
            // ... error handling
        }
        // ...
	}
    // ...
}

Picking an Image from the Photo Library and Describing it

We also support the ability to pick an individual picture from user's photo library and describe it. For this, we use PhotoUI's PhotosPicker (opens in a new tab). After the user selectes an image, we save it in the selectedImagePickerItem variable of the AIVisionExampleViewModel. When that value changes, photoSelectedFromLibrary() function is called, which converts the image into a UIImage and sets it to the capturedImage variable (which in the previous example was set after pressing the shutter button and image capture). The rest of the process is the same as above.

AIVisionExampleView.swift
// ...
struct AIVisionExampleView: View {
 
    // ...
    @StateObject private var vm                      = AIVisionExampleViewModel()
    @StateObject private var cameraViewModel         = CameraViewModel()
    @StateObject private var voiceRecordingViewModel = VoiceRecordingViewModel()
    // ...
 
    var body: some View {
        VStack {
            // ... camera feed
 
            HStack {
                PhotosPicker(
					selection: $vm.selectedImagePickerItem,
					matching: .images,
					preferredItemEncoding: .compatible
				) {
					Image(systemName: "photo.stack")
                        // ...
				}
				.onChange(of: vm.selectedImagePickerItem) {
					vm.photoSelectedFromLibrary(cameraVM: cameraViewModel)
				}
 
 
                // ... shutter button
 
                // ... flip camera button
            }
        }
        .onChange(of: cameraViewModel.capturedImage) {
			vm.detectedCapturedCameraImageUpdate(
				db: db,
                cameraVM: cameraViewModel,
                voiceRecordingVM: voiceRecordingViewModel
            )
		}
 
    }
}
AIVisionExampleViewModel.swift
// ...
class AIVisionExampleViewModel: ObservableObject {
    // ...
    @Published var selectedImagePickerItem: PhotosPickerItem? = nil
    // ...
    func photoSelectedFromLibrary(cameraVM: CameraViewModel) {
        // ...
        if let data = try? await selectedImagePickerItem?.loadTransferable(type: Data.self) {
            // ...
            await MainActor.run {
                cameraVM.capturedImage = UIImage(data: data)
            }
            // ...
        }
	}
    // ...
}

With a Custom Image Prompt

The Bunny L1 example also includes the ability to ask custom questions regarding what is in front of you. This is done by holding down the shutter button, which starts the voice recording. When the user releases the button, the voice recording stops and the audio is transcribed using the VoiceTranscriptionModel via a local whisper model.

We use the isCurrentlyRecording variable of the VoiceRecordingViewModel to detect whether the shutter button release should just capture an image or start processing.

Holding down the shutter button, sets the isCurrentlyRecording variable to true and starts the recording. So if the button is let go, we check: Is the user currently recording? If yes, we stop the recording and take a picture. If there is no recording going on (which means, the user didn't hold down), we just take a picture.

The recording and transcription is done analogously to the AI Translator Example. After the recording is finished and its transcription is saved to the currentAudioTranscription variable of the VoiceRecordingViewModel, the detectedCapturedCameraImageUpdate() function is called, which captures the image, which then triggers detectedCapturedCameraImageUpdate().

The server request is then done the same way as in previous examples, with the addition that the prompt is set to currentAudioTranscription instead of being automatically set to "Describe exactly what you see".

AIVisionExampleView.swift
// ...
struct AIVisionExampleView: View {
 
    // ...
    @StateObject private var vm                      = AIVisionExampleViewModel()
    @StateObject private var cameraViewModel         = CameraViewModel()
    @StateObject private var voiceRecordingViewModel = VoiceRecordingViewModel()
    // ...
 
    var body: some View {
        VStack {
            // ... camera feed
 
            HStack {
                // ... photos picker
 
                Button {
                    // long press over
                    vm.releasedShutterButton(cameraVM: cameraViewModel, voiceRecordingVM: voiceRecordingViewModel)
				} label: {
                    // ...
                }
				.buttonStyle(ShutterButton())
                .simultaneousGesture( // gesture to detect long press
					LongPressGesture(minimumDuration: 0.2).onEnded { _ in
						askUserFor(.microphoneAccess) {
							voiceRecordingViewModel.shouldStartRecording()
						} onDismiss: {
							showInAppNotification( /* ... no mic access warning */ )
						}
 
				})
                // ...
 
                // ... flip camera button
            }
        }
 
        .onChange(of: cameraViewModel.capturedImage) {
			vm.detectedCapturedCameraImageUpdate(
				db: db,
                cameraVM: cameraViewModel,
                voiceRecordingVM: voiceRecordingViewModel
            )
		}
        .onChange(of: voiceRecordingViewModel.currentAudioTranscription) {
			vm.detectedAudioTranscriptionUpdate(
                cameraVM: cameraViewModel,
                voiceRecordingVM: voiceRecordingViewModel
            )
		}
    }
}
VoiceRecordingViewModel.swift
class VoiceRecordingViewModel: ObservableObject {
    // ...
    @Published public private(set) var isCurrentlyRecording: Bool = false
    @Published var currentAudioTranscription: String? = nil
    @ObservedObject var voiceRecordingManager = VoiceRecordingManager()
    // ...
    func shouldStartRecording() {
        // ...
        // ... calls to the voice recording manager to start recording
        isCurrentlyRecording = true
        // ...
    }
    func shouldStopRecording() {
        // ...
        // ... calls to the voice recording manager to stop recording and transcribe the audio
        isCurrentlyRecording = false
        // ...
    }
    // ...
}
AIVisionExampleViewModel.swift
class AIVisionExampleViewModel: ObservableObject {
 
    // called when the shutter button is released in the AIVisionExampleView
	func releasedShutterButton(cameraVM: CameraViewModel, voiceRecordingVM: VoiceRecordingViewModel) {
		// ...
 
        // are we currently recording? (holding down)
		if voiceRecordingVM.isCurrentlyRecording { // yes, we are recording
            // ...
			voiceRecordingVM.shouldStopRecording()
            // ...
		} else { // no, it was just a shutter button tap
			cameraVM.captureImage()
		}
	}
 
    // called by the .onChange in the AIVisionExampleView
    @MainActor func detectedCapturedCameraImageUpdate(db: DB, cameraVM: CameraViewModel, voiceRecordingVM: VoiceRecordingViewModel) {
        // ...
        if let result = await db.processImageWithAI(
            jpegImageBase64: capturedImageBase64,
            processingCommand: voiceRecordingVM.currentAudioTranscription ?? "Describe exactly what you see.",
            readResultOutLoud: true)
        {
            // ...
            voiceRecordingVM.generalAudioModel.playAudio(base64Source: audio)
            // ...
        } else {
            // ... error handling
        }
        // ...
	}
 
    // called by the .onChange in the AIVisionExampleView
	@MainActor func detectedAudioTranscriptionUpdate(
        cameraVM: CameraViewModel,
        voiceRecordingVM: VoiceRecordingViewModel
    ) {
        // ...
		cameraVM.captureImage()
	}
}

Interaction with the Backend

We access the backend via a dedicated function processImageWithAI() that extends the DB object in BackendFunctions.swift that sends the image and the prompt to the backend and returns the vision result.

BackendFunctions.swift
// front-end
extension DB {
    public struct ImageAnalysisResultType: Codable {
 
		public let message: String // ai response to the image
		public let audio: String?  // base64 mp3 audio of the message above (if read out loud)
 
        // init using raw data from the endpoint
		public init(from rawData: [String: Any]) throws { /* ... */ }
	}
 
    // jpegImageBase64: base64 of the image that should be processed
    // processingCommand: the ai prompt to process the image with
    // readResultOutLoud: should the result be read out loud? if yes, an mp3 audio will be returned in addition to the textual response
	public func processImageWithAI(
        jpegImageBase64: String,
        processingCommand: String,
        readResultOutLoud: Bool
    ) async
		-> ImageAnalysisResultType?
	{ }
}

The Backend endpoint that will return a result of type ImageAnalysisResultType:

index.ts
type ImageAnalysisResultType = {
  message: string;
  audio: string | null;
};
index.ts
// back-end
import * as AI from "./AIKit/AI";
// ...
 
export const analyzeImageContents = onCall(async (request) => {
  let fromUid = request.auth?.uid;
 
  // ...
 
  const imageBase64 = request.data?.imageBase64 as string | null;
  const processingCommand = request.data?.processingCommand as string | null;
  const readOutLoud = request.data?.readOutLoud as boolean | false;
 
  // ...
 
  const imageAnalysisResult = await AI.accessGPTVision(
    imageBase64,
    `${
      readOutLoud // prompt prefix for the AI to be better at talking out loud
        ? "You have to answer user's command as if you were speaking to a human. The following is a question/prompt and you must answer it as naturally as possible, but dont yap for too long. Be only verbose when asked directly or necessary in this context: "
        : ""
    }${processingCommand}`
  );
 
  // ...
 
  if (!readOutLoud) {
    const result: ImageAnalysisResultType = {
      message: imageAnalysisResult,
      audio: null,
    };
 
    // ...
    return result;
  }
 
  const audioBufferResult = await AI.convertTextToMp3Base64Audio(
    imageAnalysisResult
  );
 
  // ...
 
  const result: ImageAnalysisResultType = {
    message: imageAnalysisResult,
    audio: audioBufferResult.toString("base64"),
  };
 
  // ...
  return result;
  // ...
});

The endpoint above uses two pre-made AIKit functions accessGPTVision() and convertTextToMp3Base64Audio() to analyze an image with the provided prompt and convert the result to audio.

  • accessGPTVision() sends an image and a prompt to the GPT-4o vision model and returns an response.
  • convertTextToMp3Base64Audio() takes text as input, converts it to audio using OpenAI's TTS (Text-To-Speech) model, and returns the audio as a base64 string.
AI.ts
// back-end
import OpenAI from "openai";
import { openAIApiKey } from "../config";
 
const openai = new OpenAI({
  apiKey: openAIApiKey,
});
 
export async function accessGPTVision(
  imageBase64: string,
  imageProcessingCommand: string
): Promise<string | null> {
  const response = await openai.chat.completions.create({
    // ... chat completions with vision
    // https://platform.openai.com/docs/guides/vision
  });
  return response.choices[0].message.content;
  // ...
}
 
export async function convertTextToMp3Base64Audio(
  text: string
): Promise<Buffer | null> {
  // ...
  const mp3 = await openai.audio.speech.create({
    /* ... */
  });
  // ...
  const buffer = Buffer.from(await mp3.arrayBuffer());
  return buffer;
  // ...
}