On-Device Real-Time AI Audio Filters with Stable Audio Open Small and the Switchboard SDK
Stable Audio Open Small is a newly open-sourced 341 million-parameter text-to-audio model from Stability AI that runs entirely on Arm CPUs. It can generate up to ~11 seconds of high-quality stereo audio on a smartphone in under 8 seconds. It fits perfectly with our Switchboard SDK, our cross-platform audio pipeline framework geared towards real-time audio software development. Together, these technologies enable mobile developers to build advanced voice features, like generative sound effects, intelligent voice filters, and on-device audio processing, completely in real time on consumer devices. This combination unlocks powerful new on-device audio intelligence capabilities without requiring cloud services.
Why On-Device Matters
Running audio intelligence on-device has key advantages for mobile voice applications. Latency is dramatically reduced when all processing happens locally. There's no network round-trip, so interactive voice features respond near-instantly. Privacy is enhanced as sensitive audio (e.g. personal voice notes or calls) never leaves the user's device. It also means offline availability, allowing features like voice filters or transcription to work even without internet access. For mobile developers, this translates to smoother user experiences: imagine a voice chat app applying effects with virtually zero delay, or a voice recorder that cleans up audio as you speak. These real-time interactions are only feasible when AI models run on the device itself, close to the source of the audio. Switchboard's design reflects this need by being “geared towards real time audio software development”, ensuring that audio pipelines execute with minimal latency on mobile hardware.
The Opportunity: Real-Time Voice Features in Consumer Apps
With on-device audio AI, a new wave of voice-driven user experiences is emerging. Consider messaging apps where users can send smart voice notes: as you record a voice message, the app could live-filter background noise, adjust volume levels, or even add fun sound effects, all in real time. Social platforms and camera apps are already popularizing voice changers and audio filters for video stories; on-device models make these effects instantaneous and more accessible. In voice chat for gaming or live streaming, real-time voice transformation (like changing your voice to a character or applying comic effects) can enhance user engagement. Even practical UX improvements are possible. For example, automatically compressing and enhancing voice notes so they sound clear while taking less bandwidth. The market opportunity is broad: voice interfaces are becoming mainstream in communication, entertainment, and assistive apps, and users now expect responsive, interactive audio features. By leveraging stable on-device models, developers can differentiate their apps with voice-controlled filters, dynamic soundtracks, personalized audio responses, and other novel features that respond immediately to the user's voice.
What Stable Audio Open Small Brings to the Table
Stable Audio Open Small is a breakthrough in making generative audio practical on mobile. It's a compact version of Stability AI's text-to-audio model (down from 1.1B to 341M parameters) with optimized performance for mobile CPUs. Despite its smaller size, it preserves impressive output quality and adherence to prompts. Technically, the model can produce 44.1 kHz stereo audio from a text description, up to around 11 seconds in length. It excels at generating short audio samples, sound effects, and musical clips (drum loops, instrument riffs, Foley effects, ambient textures, etc.) This makes it ideal for mobile apps that need on-demand audio snippets. Crucially, Stable Audio Open Small is engineered for speed and efficiency: it's “the fastest stereo text-to-audio model on the market”, capable of mobile inference in just a few seconds for multi-second audio clips. In practice, its compact size and fast inference make it a “perfect fit for on-device deployment on Arm-powered smartphones and edge devices, where real-time generation and responsiveness matter.” The entire model can be packaged at roughly tens of megabytes (on the order of ~20 MB), which is tiny in comparison to typical cloud-scale AI models. And by leveraging Arm's optimized libraries (like KleidiAI), it runs efficiently on common mobile chipsets without requiring GPU acceleration. For developers, this means Stable Audio Open Small can be embedded directly into apps and run on a wide range of devices, from high-end phones to resource-constrained IoT gadgets, enabling AI audio features that were previously only possible with server-side processing.
How to Use Stable Audio Open Small with Switchboard
Integrating Stable Audio Open Small into a mobile app is straightforward with the Switchboard SDK. Switchboard lets you construct an audio graph (a pipeline of audio nodes) that can include sources (inputs), processors (effects or ML models), and sinks (outputs). In this architecture, Stable Audio Open Small can be encapsulated as a custom ML node inside the graph.
The cleanest way to embed Stable Audio Open Small in a Switchboard graph is to convert the released PyTorch checkpoint to ONNX and load it with the SDK's ONNX extension. Switchboard already wraps ONNX Runtime and exposes three audio‑centric nodes: ONNX.MLSource, ONNX.MLProcessor, and ONNX.MLSink; so any exported model drops into the graph like a normal effect or generator. The extension streams audio buffers through ONNX Runtime in real time, which keeps latency below twenty milliseconds on modern mobile CPUs.
Step 1 . Export the model
import torch
from stable_audio import StableAudioSmall # weights from Hugging Face
model = StableAudioSmall.from_pretrained("stabilityai/stable-audio-open-small")
model.eval()
dummy = torch.randint(0, 1000, (1, 64)) # token ids
torch.onnx.export(
model,
(dummy,),
"stable_audio_open_small.onnx",
input_names=["prompt_ids"],
output_names=["audio_pcm"],
dynamic_axes={"prompt_ids": {0: "batch"}, "audio_pcm": {0: "batch"}},
opset_version=17,
)
Optionally run python -m onnxruntime.tools.convert_onnx_models_to_ort --input stable_audio_open_small.onnx --output stable_audio_open_small.ort --float16 to shrink the file and enable ORT‑mobile optimisations.
Step 2 . Wire the model into an audio graph
import SwitchboardSDK
let engine = SBAudioEngine()
engine.microphoneEnabled = true // live mic
let modelPath = Bundle.main.path(
forResource: "stable_audio_open_small", ofType: "onnx")!
let genNode = ONNXProcessorNode(modelPath: modelPath,
inputFormat: .pcm,
outputFormat: .pcm,
blockSize: 512) // ≈12 ms at 44.1 kHz
let graph = SBAudioGraph()
graph.add(genNode)
graph.connect(engine.inputNode, to: genNode) // mic → model
graph.connect(genNode, to: engine.outputNode) // model → speaker
In this pipeline, the device's microphone feeds audio into the Stable Audio Open Small node, which processes or transforms the audio, and the resulting sound is sent to the device speaker in real-time. (Switchboard's engine automatically connects the physical mic to the graph's inputNode and the phone's speaker to the outputNode.) Depending on how you configure the model, the Stable Audio node could, for example, apply an AI noise filter, perform voice style conversion, or even generate a completely new audio stream based on the input. The key is that with Switchboard, you can treat the ML model like any other audio effect in the signal chain. The SDK handles buffering, threading, and audio I/O under the hood. This lets developers focus on the high-level logic (like feeding in prompts or switching effects) without delving into low-level audio processing or DSP.
Example Use Case: Smart Voice Messaging
To illustrate the possibilities, imagine a smart voice messaging feature in a chat application. Normally, when users record a voice note, it's sent as-is. But using Stable Audio Open Small with Switchboard, we can enhance this experience in real time. For instance, as the user records their voice message, the app could live-filter the audio to remove noise and optimize clarity. Simultaneously, it might apply a creative voice filter, perhaps making the voice sound like a musical instrument or adding a subtle background ambience to match the message's mood. Stable Audio Open Small is well-suited for generating such background audio or sound effects on the fly (e.g. a quick “ambient cafe noise” bed under a voice note to indicate atmosphere). The Switchboard graph for this could combine multiple nodes: a noise suppression node, the Stable Audio generative node, and a mixer to blend the original voice with any generated effects. All of this would happen locally and in real-time, giving the user a preview of their augmented voice note as they record it.
As a concrete example, consider a voice memo app that implements intelligent compression of voice messages. When a user speaks for, say, 30 seconds, the app could use on-device speech-to-text to transcribe the content, then use Stable Audio Open Small to regenerate a concise audio summary or highlight reel from that text. This would drastically reduce the message length while preserving the key information, effectively using AI to compress the audio. With the Switchboard SDK, one could set up a pipeline where the steps are: Microphone input → Speech-to-text node (e.g. Whisper Node) → Text summarization (in-app logic) → Stable Audio Open Small node (as a source node generating audio from the summary text) → output/recording. The end result is a short, synthesized voice note that the app can play back or send; created entirely on-device in real time.
Performance and Suitability
One of the most compelling aspects of Stable Audio Open Small is its performance profile on mobile hardware. According to Stability AI, the model can generate audio faster than real-time on a modern smartphone, ~11 seconds of audio in under 8 seconds on a typical device. In practical terms, this means the model runs at roughly 1.4× faster than real time for its target scenario. With further optimizations (quantized weights, efficient runtimes, etc.), developers have reported inference latencies on the order of only tens of milliseconds for small audio snippets on devices like the Google Pixel 7. The model's small footprint also matters: at ~341 million parameters with mixed precision, Stable Audio Open Small can be packaged in a relatively lightweight bundle (on the order of only a few tens of MB). For mobile apps, a model of this size is very manageable, and loading it into memory won't strain most phones. Importantly, it doesn't require a GPU or specialized accelerator, the CPU-only design means it can run on “Arm-powered smartphones without heavy hardware requirements”. Tests by Stability AI and Arm showed the model running comfortably on standard phone chipsets, aided by Arm's optimized libraries. In one demo, the team achieved roughly ~7 seconds inference time on a phone for ~10 seconds of audio output. This level of performance opens the door for real-time streaming applications. Developers can trust that Stable Audio Open Small will not only fit on users' devices but also perform well across a wide range of mobile hardware, from flagship phones to mid-tier devices and tablets. This broad suitability is crucial for consumer apps, which need to serve users with varying device capabilities. With on-device processing, there's also a benefit in consistency and reliability: the app's audio features will work in any environment (no dependence on network) and with predictable latency. In summary, Stable Audio Open Small delivers a rare combination of speed, size, and quality that makes truly real-time mobile audio AI feasible.
Developer Takeaways
Stable Audio Open Small and the Switchboard SDK together represent a breakthrough for mobile developers looking to innovate with voice and audio. We now have a production-ready, open generative audio model that can live inside a mobile app, producing sound on the fly, and a robust audio engine to seamlessly integrate it into real-time pipelines. This empowers developers to build features that were previously confined to cloud services or high-end hardware: from instantaneous voice filters and personalized soundtracks to intelligent voice notes that are cleaned-up, compressed, or even creatively transformed in real time. The latency and privacy benefits of on-device processing can greatly improve user experience, making voice interactions feel snappier and more secure. And because both the model and the SDK are designed for efficiency, these advanced audio features can reach users across many device types, not just those with the latest phones. The key takeaway is that real-time audio AI on mobile is here: it's fast, accessible, and can be integrated with only modest effort thanks to Switchboard's developer-friendly framework. If you're building a mobile app that deals with voice or sound, now is the time to experiment with this technology. The combination of Stable Audio Open Small's on-device generative power and Switchboard's real-time audio pipeline can give your app a cutting-edge audio experience that sets it apart.
Want to see what else we're building? Check out Switchboard and Synervoz.