AI Voice Translator Mini App Overview (AI Voice Example)

AIKit (SwiftyLaunch Module) - AI Voice Example

This Mini-App is a translator that takes in a voice input, translates it and reads it out loud in the selected language.

Features

Simple UI: A button that records while you hold down and a language selector. What else do you need?
Local Model: The mini app utilizes a small multi-lingual whisper model, which is a speech-to-text model that we use to transcribe the voice input.
OpenAI Cloud Models: To translate the text and generate voice output, we use OpenAI's API.
AnalyticsKit Integration: If AnalyticsKit is selected during project generation, events and errors regarding user interacting with the AI Translator mini-app will automatically be tracked with aikit as the source (both client-side and server-side).

Demo

Here's a demo of the AI Voice Translator Mini-App in action:

Implementation

Overview

AI Voice Example Architecture Overview

Objects and Their Responsibilities

Here are the object that this mini-app uses and a short summary of their responsibilities?

AIVoiceExampleViewModel: This view model handles interaction with the user interface, as well as holds the functions which are called on state changes (e.g. detectAudioTranscriptionUpdate).
VoiceRecordingViewModel: This view model is responsible for handling user interactions related to voice recording (e.g. shouldStartRecording).
VoiceRecordingManager: This manager is directly responsible for handling the recording of the voice input and calling the VoiceTranscriptionModel to transcribe the audio.
VoiceTranscriptionModel: This model is responsible for transcribing the audio using the bundled whisper model.
GeneralAudioModel: This model is responsible for playing audio files.
DB: This object is responsible for handling the communication with the backend and authentication (FirebaseBackend.swift), as well as interacting with BackendKit functions (BackendFunctions.swift).

💡

Why so many? Because object oriented programming and ✨abstractions✨.

Recording the Voice Input and Transcribing it

In two words: User holds the button down → start recording → user lets go → stop recording & transcribe.

In more than two words:

When the user starts holding down on the microphone button, the view tells the voice recording view model (VoiceRecordingViewModel) that it should start recording (which calls the with the startRecording() function of VoiceRecordingManager).

The recording starts if user has granted us microphone access, by passing the function as a closure to the askUserFor(.microphone, executeIfGotAccess: {}) function. Read more about that in Request Permission to access the Camera, Microphone, Location, etc..

Once the user lets go off the button, the voice example view model (VoiceExampleViewModel) will stop recording (call the shouldStopRecording() of VoiceRecordingViewModel, which calls the stopRecording() function of VoiceRecordingManager).

Once the VoiceTranscriptionManager successfully stops the recording, it will call the transcribeAudio() function of the VoiceTranscriptionModel which will transcribe the audio using the local whisper model.

AIVoiceExampleView.swift

struct AIVoiceExampleView: View {
 
	@StateObject private var vm = AIVoiceExampleViewModel()
	@StateObject private var voiceRecordingVM = VoiceRecordingViewModel()
 
    var body: some View {
        VStack {
            Button {
                // Translate button released -> call the function to translate the voice recording
				vm.releasedTranslateButton(voiceRecordingVM: voiceRecordingVM)
            } label: { /* ... */ }
            .buttonStyle(TransateButton())
 
            // ...
 
            // If the user presses for at least 0.2 and gave us microphone access
            // -> call our voice recording function that will initiate recording
			.simultaneousGesture(
				LongPressGesture(minimumDuration: 0.2).onEnded { _ in
					askUserFor(.microphoneAccess) {
						voiceRecordingVM.shouldStartRecording()
					} onDismiss: { /* ... */}
				})
        }
    }
}

VoiceRecordingViewModel.swift

class VoiceRecordingViewModel: ObservableObject {
 
    /// The VoiceRecordingManager that will handle the recording
	@ObservedObject var voiceRecordingManager = VoiceRecordingManager()
 
    // ...
 
	func shouldStartRecording() {
        // ...
        voiceRecordingManager.startRecording()
        // ...
	}
 
	func shouldStopRecording() {
        // ...
		voiceRecordingManager.stopRecording(success: true)
        // ...
	}
    // ...
}

AIVoiceExampleViewModel.swift

class AIVoiceExampleViewModel: ObservableObject {
    // ...
	func releasedTranslateButton(voiceRecordingVM: VoiceRecordingViewModel) {
        // ...
        voiceRecordingVM.shouldStopRecording()
        // ...
	}
    // ...
}

VoiceRecordingManager.swift

class VoiceRecordingManager: NSObject, ObservableObject, AVAudioRecorderDelegate {
 
    @ObservedObject var voiceTranscriptionModel = VoiceTranscriptionModel()
 
    // ...
	func startRecording() {
        // ...
		do {
            // ... set audio recording session
			audioRecorder.record()
		} catch {
 
            // ...
 
			// In case of an error, stop the recording
			stopRecording(success: false)
		}
	}
 
    // will stop the recording. if success -> transcribe the audio, if not -> means an error occured
	func stopRecording(success: Bool) {
        // ...
		audioRecorder.stop()
 
		if success {
            // ...
            do {
                let transcribedAudio = try await voiceTranscriptionModel.transcribeAudio(fromURL: getTempRecordingPath())
 
                voiceTranscriptionModel.currentAudioTranscription = transcribedAudio
                // ...
            } catch {
                // ...
            }
		}
	}
}

Translating the Text and Playing it out Loud

Once we detect an update in the voiceRecordingVM.currentAudioTranscription variable (and it isn't nil, meaning we have a new transcription), we send that transcription to the backend to translate it to the selected language.

The backend will then send the translation back to the client, which will either read it out loud or show it as an in-app notification. (Currently hardcoded to always read it out loud)

AIVoiceExampleView.swift

import FirebaseKit
 
struct AIVoiceExampleView: View {
	@EnvironmentObject var db: DB // FirebaseKit DB object
	@StateObject private var vm = AIVoiceExampleViewModel()
	@StateObject private var voiceRecordingVM = VoiceRecordingViewModel()
 
    var body: some View {
        VStack {
            // ... microphone button
            Menu {
				Picker("Will translate to:", selection: $vm.selectedOutputLanguage) {
					ForEach(languages, id: \.self) {
						Text($0)
                        // ...
					}
				}
			} label: {
				HStack {
					Text("Will translate to \(vm.selectedOutputLanguage)")
                    // ...
				}
                // ...
			}
 
        }
        .onReceive(voiceRecordingVM.$currentAudioTranscription) { transcription in
            vm.detectedAudioTranscriptionUpdate(db: db, voiceRecordingVM: voiceRecordingVM)
        }
    }
}

AIVoiceExampleViewModel.swift

class AIVoiceExampleViewModel: ObservableObject {
 
    /// The selected output language for the translation
	@Published var selectedOutputLanguage = "Russian"
 
    @MainActor func detectedAudioTranscriptionUpdate(db: DB, voiceRecordingVM: VoiceRecordingViewModel) {
        // ...
        if let result = await db.processTextWithAI(
            text:
                "Translate the following text into \(selectedOutputLanguage). Make it sound as natural as possible: \(recordedTranscription)",
            readResultOutLoud: true)
        {
            // Read result out loud -> will return an audio reading out the translation, so play it
            // If not, will return the translation as a message, which we will show as an in-app notification
            if let audio = result.audio {
                voiceRecordingVM.generalAudioModel.playAudio(base64Source: audio)
            } else {
                showInAppNotification(/* ... */)
            }
        } else { /* ... */ }
        // ...
	}
}

We access the backend via a dedicated function processTextWithAI() that extends the DB object in BackendFunctions.swift that sends the text to the backend and returns the result.

BackendFunctions.swift

// front-end
extension DB {
    public struct TextAnalysisResultType: Codable {
 
      public let message: String // the translated message
      public let audio: String?  // base64 mp3 audio of the translated message (if read out loud)
 
      // init using raw data from the endpoint
      public init(from rawData: [String: Any]) throws { /* ... */ }
	}
 
  // text: text to prompt GPT with
  // readOutLoud: should the backend also send a tts audio mp3 of the response?
  public func processTextWithAI(
    text: String,
    readResultOutLoud: Bool
  ) async -> TextAnalysisResultType? { }
}

The Backend endpoint that will return a result of type TextAnalysisResultType:

index.ts

type TextAnalysisResultType = {
  message: string;
  audio: string | null;
};

index.ts

// back-end
import * as AI from "./AIKit/AI";
// ...
 
export const analyzeTextContents = onCall(async (request) => {
  // ...
  const text = request.data?.text as string | null;
  const readOutLoud = request.data?.readOutLoud as boolean | false;
  //
  const textAnalysisResult = await AI.accessGPTChat({ text });
 
  // ...
  if (!readOutLoud) {
    const result: TextAnalysisResultType = {
      message: textAnalysisResult,
      audio: null,
    };
    // ...
    return result;
  }
  // ...
 
  // If we should read out loud, use openai's tts to convert the text to audio
  const audioBufferResult = await AI.convertTextToMp3Base64Audio(
    textAnalysisResult
  );
  // ...
  const result: TextAnalysisResultType = {
    message: textAnalysisResult,
    audio: audioBufferResult.toString("base64"),
  };
  // ...
  return result;
});

The endpoint above uses two pre-made AIKit functions accessGPTChat() and convertTextToMp3Base64Audio() to get the translation and convert it to audio.

accessGPTChat() sends the text to the GPT-4o model and returns an answer (we provide the prompt fron the client side)
convertTextToMp3Base64Audio() takes text as input, converts it to audio using OpenAI's TTS (Text-To-Speech) model, and returns the audio as a base64 string.

AI.ts

// back-end
import OpenAI from "openai";
import { openAIApiKey } from "../config";
 
const openai = new OpenAI({
  apiKey: openAIApiKey,
});
 
export type GPTChatMessage = {
  role: "user" | "assistant";
  content: string;
};
 
export async function accessGPTChat({
  text,
  previousChatMessages = [],
}: {
  text: string;
  previousChatMessages?: GPTChatMessage[];
}): Promise<string | null> {
  const response = await openai.chat.completions.create({
    /* ... */
  });
  return response.choices[0].message.content;
}
 
export async function convertTextToMp3Base64Audio(
  text: string
): Promise<Buffer | null> {
  // ...
  const mp3 = await openai.audio.speech.create({
    /* ... */
  });
  // ...
  const buffer = Buffer.from(await mp3.arrayBuffer());
  return buffer;
  // ...
}

AI Chat Bot Mini-App Overview