AI Voice Translator Mini App Overview (AI Voice Example)
This Mini-App is a translator that takes in a voice input, translates it and reads it out loud in the selected language.
Features
- Simple UI: A button that records while you hold down and a language selector. What else do you need?
- Local Model: The mini app utilizes a small multi-lingual whisper model, which is a speech-to-text model that we use to transcribe the voice input.
- OpenAI Cloud Models: To translate the text and generate voice output, we use OpenAI's API.
- AnalyticsKit Integration: If AnalyticsKit is selected during project generation, events and errors regarding user
interacting with the AI Translator mini-app will automatically be tracked with
aikit
as the source (both client-side and server-side).
Demo
Here's a demo of the AI Voice Translator Mini-App in action:
Implementation
Overview
Objects and Their Responsibilities
Here are the object that this mini-app uses and a short summary of their responsibilities?
- AIVoiceExampleViewModel: This view model handles interaction with the user interface, as well as holds the functions which are called on state changes (e.g.
detectAudioTranscriptionUpdate
). - VoiceRecordingViewModel: This view model is responsible for handling user interactions related to voice recording (e.g.
shouldStartRecording
). - VoiceRecordingManager: This manager is directly responsible for handling the recording of the voice input and calling the
VoiceTranscriptionModel
to transcribe the audio. - VoiceTranscriptionModel: This model is responsible for transcribing the audio using the bundled whisper model.
- GeneralAudioModel: This model is responsible for playing audio files.
- DB: This object is responsible for handling the communication with the backend and authentication
(
FirebaseBackend.swift
), as well as interacting with BackendKit functions (BackendFunctions.swift
).
Why so many? Because object oriented programming and ✨abstractions✨.
Recording the Voice Input and Transcribing it
In two words: User holds the button down → start recording → user lets go → stop recording & transcribe.
In more than two words:
When the user starts holding down on the microphone button, the view tells the voice recording
view model (VoiceRecordingViewModel
) that it should start recording
(which calls the with the startRecording()
function of VoiceRecordingManager
).
The recording starts if user has granted us microphone access, by passing the function as a closure to the
askUserFor(.microphone, executeIfGotAccess: {})
function. Read more about that in Request Permission to access the Camera, Microphone, Location, etc..
Once the user lets go off the button, the voice example view model (VoiceExampleViewModel
) will stop recording (call the shouldStopRecording()
of VoiceRecordingViewModel
, which calls
the stopRecording()
function of VoiceRecordingManager
).
Once the VoiceTranscriptionManager
successfully stops the recording, it will call the transcribeAudio()
function of the VoiceTranscriptionModel
which will transcribe the audio using the local whisper model.
struct AIVoiceExampleView: View {
@StateObject private var vm = AIVoiceExampleViewModel()
@StateObject private var voiceRecordingVM = VoiceRecordingViewModel()
var body: some View {
VStack {
Button {
// Translate button released -> call the function to translate the voice recording
vm.releasedTranslateButton(voiceRecordingVM: voiceRecordingVM)
} label: { /* ... */ }
.buttonStyle(TransateButton())
// ...
// If the user presses for at least 0.2 and gave us microphone access
// -> call our voice recording function that will initiate recording
.simultaneousGesture(
LongPressGesture(minimumDuration: 0.2).onEnded { _ in
askUserFor(.microphoneAccess) {
voiceRecordingVM.shouldStartRecording()
} onDismiss: { /* ... */}
})
}
}
}
class VoiceRecordingViewModel: ObservableObject {
/// The VoiceRecordingManager that will handle the recording
@ObservedObject var voiceRecordingManager = VoiceRecordingManager()
// ...
func shouldStartRecording() {
// ...
voiceRecordingManager.startRecording()
// ...
}
func shouldStopRecording() {
// ...
voiceRecordingManager.stopRecording(success: true)
// ...
}
// ...
}
class AIVoiceExampleViewModel: ObservableObject {
// ...
func releasedTranslateButton(voiceRecordingVM: VoiceRecordingViewModel) {
// ...
voiceRecordingVM.shouldStopRecording()
// ...
}
// ...
}
class VoiceRecordingManager: NSObject, ObservableObject, AVAudioRecorderDelegate {
@ObservedObject var voiceTranscriptionModel = VoiceTranscriptionModel()
// ...
func startRecording() {
// ...
do {
// ... set audio recording session
audioRecorder.record()
} catch {
// ...
// In case of an error, stop the recording
stopRecording(success: false)
}
}
// will stop the recording. if success -> transcribe the audio, if not -> means an error occured
func stopRecording(success: Bool) {
// ...
audioRecorder.stop()
if success {
// ...
do {
let transcribedAudio = try await voiceTranscriptionModel.transcribeAudio(fromURL: getTempRecordingPath())
voiceTranscriptionModel.currentAudioTranscription = transcribedAudio
// ...
} catch {
// ...
}
}
}
}
Translating the Text and Playing it out Loud
Once we detect an update in the voiceRecordingVM.currentAudioTranscription
variable (and it isn't nil, meaning we
have a new transcription), we send that transcription to the backend to translate it to the selected language.
The backend will then send the translation back to the client, which will either read it out loud or show it as an in-app notification. (Currently hardcoded to always read it out loud)
import FirebaseKit
struct AIVoiceExampleView: View {
@EnvironmentObject var db: DB // FirebaseKit DB object
@StateObject private var vm = AIVoiceExampleViewModel()
@StateObject private var voiceRecordingVM = VoiceRecordingViewModel()
var body: some View {
VStack {
// ... microphone button
Menu {
Picker("Will translate to:", selection: $vm.selectedOutputLanguage) {
ForEach(languages, id: \.self) {
Text($0)
// ...
}
}
} label: {
HStack {
Text("Will translate to \(vm.selectedOutputLanguage)")
// ...
}
// ...
}
}
.onReceive(voiceRecordingVM.$currentAudioTranscription) { transcription in
vm.detectedAudioTranscriptionUpdate(db: db, voiceRecordingVM: voiceRecordingVM)
}
}
}
class AIVoiceExampleViewModel: ObservableObject {
/// The selected output language for the translation
@Published var selectedOutputLanguage = "Russian"
@MainActor func detectedAudioTranscriptionUpdate(db: DB, voiceRecordingVM: VoiceRecordingViewModel) {
// ...
if let result = await db.processTextWithAI(
text:
"Translate the following text into \(selectedOutputLanguage). Make it sound as natural as possible: \(recordedTranscription)",
readResultOutLoud: true)
{
// Read result out loud -> will return an audio reading out the translation, so play it
// If not, will return the translation as a message, which we will show as an in-app notification
if let audio = result.audio {
voiceRecordingVM.generalAudioModel.playAudio(base64Source: audio)
} else {
showInAppNotification(/* ... */)
}
} else { /* ... */ }
// ...
}
}
We access the backend via a dedicated function processTextWithAI()
that extends the DB
object in BackendFunctions.swift
that sends the text to the backend and returns the result.
// front-end
extension DB {
public struct TextAnalysisResultType: Codable {
public let message: String // the translated message
public let audio: String? // base64 mp3 audio of the translated message (if read out loud)
// init using raw data from the endpoint
public init(from rawData: [String: Any]) throws { /* ... */ }
}
// text: text to prompt GPT with
// readOutLoud: should the backend also send a tts audio mp3 of the response?
public func processTextWithAI(
text: String,
readResultOutLoud: Bool
) async -> TextAnalysisResultType? { }
}
The Backend endpoint that will return a result of type TextAnalysisResultType
:
type TextAnalysisResultType = {
message: string;
audio: string | null;
};
// back-end
import * as AI from "./AIKit/AI";
// ...
export const analyzeTextContents = onCall(async (request) => {
// ...
const text = request.data?.text as string | null;
const readOutLoud = request.data?.readOutLoud as boolean | false;
//
const textAnalysisResult = await AI.accessGPTChat({ text });
// ...
if (!readOutLoud) {
const result: TextAnalysisResultType = {
message: textAnalysisResult,
audio: null,
};
// ...
return result;
}
// ...
// If we should read out loud, use openai's tts to convert the text to audio
const audioBufferResult = await AI.convertTextToMp3Base64Audio(
textAnalysisResult
);
// ...
const result: TextAnalysisResultType = {
message: textAnalysisResult,
audio: audioBufferResult.toString("base64"),
};
// ...
return result;
});
The endpoint above uses two pre-made AIKit functions accessGPTChat()
and convertTextToMp3Base64Audio()
to get the translation and convert it to audio.
accessGPTChat()
sends the text to the GPT-4o model and returns an answer (we provide the prompt fron the client side)convertTextToMp3Base64Audio()
takes text as input, converts it to audio using OpenAI's TTS (Text-To-Speech) model, and returns the audio as a base64 string.
// back-end
import OpenAI from "openai";
import { openAIApiKey } from "../config";
const openai = new OpenAI({
apiKey: openAIApiKey,
});
export type GPTChatMessage = {
role: "user" | "assistant";
content: string;
};
export async function accessGPTChat({
text,
previousChatMessages = [],
}: {
text: string;
previousChatMessages?: GPTChatMessage[];
}): Promise<string | null> {
const response = await openai.chat.completions.create({
/* ... */
});
return response.choices[0].message.content;
}
export async function convertTextToMp3Base64Audio(
text: string
): Promise<Buffer | null> {
// ...
const mp3 = await openai.audio.speech.create({
/* ... */
});
// ...
const buffer = Buffer.from(await mp3.arrayBuffer());
return buffer;
// ...
}