Music AI in the Browser with ONNX: Neural Synthesis and DDSP Explained
What Does It Mean to Use ONNX for Music in the Browser?
Running ONNX for music in the browser means executing machine learning models directly in your browser — no installation required — to enable AI-powered instrument synthesis, voice generation, stem separation, and more in real time. Until the early 2020s, "making music with AI" meant setting up a Python environment and connecting to a GPU server. But the combination of ONNX Runtime Web, WebGPU, and WebAssembly has changed that. Today, opening Chrome is all it takes to get processing speeds that rival native applications.
This article covers everything in one place: how ONNX runs in the browser, the principles behind neural synthesis and DDSP, what WebGPU acceleration actually delivers, and tools you can use right now. Whether you're a music producer, a developer, or a researcher, you'll find something here worth digging into.
What Is ONNX — and Why Does Music AI Use It?
ONNX (Open Neural Network Exchange) is an open model format co-developed by Microsoft and Meta. It lets you convert a model trained in PyTorch or TensorFlow into a .onnx file and run inference on virtually any runtime or device. There are three main reasons ONNX has become the go-to format for music AI:
- Framework-agnostic: Models like Demucs (stem separation, built in PyTorch) and Basic Pitch (audio-to-MIDI, built in TensorFlow) can both run on the same ONNX Runtime.
- ONNX Runtime Web: By switching between a WebAssembly backend and a WebGPU backend, it delivers optimized inference on everything from CPU-only machines to GPU-equipped ones.
- Practical model sizes: Dynamic quantization (INT8/FP16) can shrink a Demucs inference model from hundreds of megabytes to tens of megabytes — small enough to download in a browser.
ONNX Runtime Web Backend Comparison
ONNX Runtime Web currently supports four backends, each with different speed and compatibility trade-offs.
- cpu (pure JS): Most compatible, but slowest. Separating a 60-second track can take 3–5 minutes.
- wasm (WebAssembly): Uses the CPU at near-C++ speeds. Enabling SIMD can yield a 2–3× speedup.
- webgl: Uses GPU shaders. Was the primary GPU option before WebGPU, but can produce accuracy issues in some cases.
- webgpu: Uses the latest GPU API, reaching 50–80% of native inference speed (more on this below).
How Neural Synthesis Works
Neural synthesis refers to using a neural network to directly generate and control audio waveforms, bypassing traditional signal-generation algorithms. Where a classic synthesizer follows an oscillator → filter → amplifier signal chain, a neural synth takes a more direct route: MIDI note / pitch / velocity → network → waveform.
Key Neural Synthesis Architectures
Here's a rundown of the main architectures in active use today.
- WaveNet (Google DeepMind, 2016): Generates waveforms sample-by-sample using autoregressive prediction. Exceptional quality, but so computationally heavy it originally required TPUs for real-time output. Now primarily used for text-to-speech.
- DDSP (Differentiable Digital Signal Processing): A hybrid approach that embeds physical modeling inside a neural network (covered in depth below).
- NSynth (Magenta): A WaveNet-based encoder that learns a latent "timbre space" and interpolates between sounds.
- MIDI-DDSP (Magenta, 2022): Takes MIDI as input, predicts physical performance parameters per instrument, and synthesizes the final audio with DDSP.
- EnCodec / DAC: Neural audio codecs that work with latent representations. Used as the foundation for generative models like MusicGen.
What Is DDSP? Bridging Physical Modeling and AI
DDSP (Differentiable Digital Signal Processing) was introduced in 2020 by Google's Magenta team and is arguably one of the most significant advances in music AI. The core idea is to implement existing signal processing operations — FFT, additive synthesis, subtractive filtering — as differentiable functions that can be embedded inside a neural network.
How DDSP Works: Step by Step
- Encode: Extract fundamental frequency (F0), loudness, and a latent vector from an input audio signal (e.g., a violin recording).
- Decode: A small MLP predicts the harmonic amplitude series for additive synthesis and the coefficients for a noise filter, based on F0, loudness, and the latent vector.
- Synthesize: Differentiable additive and noise synthesizers generate a waveform from the predicted parameters.
- Compute loss: The difference between the generated and target audio (spectral loss) is backpropagated to optimize the network.
The biggest practical advantage of DDSP is model size. It achieves quality that previously required millions of parameters (à la WaveNet) with only tens of thousands — sometimes fewer. That's a key reason browser-based inference is feasible at all.
What DDSP Can and Can't Do
DDSP excels at timbre transfer and time-stretching for sustained instruments — violin, flute, voice — but it's not well-suited to percussive sounds like drums. It also requires training a separate model for each instrument class, so it's best understood as a highly specialized synthesizer rather than a universal audio generator.
How WebGPU Changed DDSP and ONNX Inference
WebGPU launched as a stable feature in Chrome in 2023, succeeding WebGL as the browser's GPU API. Its impact on music AI inference is best illustrated with concrete numbers.
- Demucs (stem separation): Separating a 60-second track took roughly 4 minutes with WebAssembly (CPU only); with WebGPU it takes about 40–60 seconds — a 4–5× real-time speedup.
- Basic Pitch (audio → MIDI): Converting 5 seconds of audio took ~15 seconds on WebAssembly; WebGPU brings that down to 3–4 seconds.
- MIDI-DDSP real-time synthesis: Real-time performance was not achievable with WebGL, but WebGPU enables low-latency operation at 64–128 sample buffers.
WebGPU + ONNX Runtime Web: Code Overview
For developers, here's the basic setup for enabling the WebGPU backend in ONNX Runtime Web.
import * as ort from 'onnxruntime-web';
ort.env.wasm.wasmPaths = '/ort/';
const session = await ort.InferenceSession.create('/model.onnx', {
executionProviders: ['webgpu', 'wasm'], // prefer WebGPU, fall back to wasm
});
const input = new ort.Tensor('float32', audioData, [1, 1, audioData.length]);
const output = await session.run({ input });
Simply listing 'webgpu' first in executionProviders is enough to enable GPU inference on supported browsers. Browsers without WebGPU support automatically fall back to wasm, so compatibility is maintained.
MIDI-DDSP: Browser AI That Generates Realistic Instruments from MIDI
Magenta's MIDI-DDSP is a practical model that combines DDSP with neural synthesis. Feed it MIDI notes, and it generates audio waveforms in real time — complete with instrument-specific performance nuances like vibrato, attack variation, and dynamic shaping. Unlike SF2 soundfonts or conventional samplers, MIDI-DDSP physically "performs" the notes, so there's none of the mechanical repetition you get from looped samples.
MIDI-DDSP Browser Implementation Flow
- Retrieve note events (pitch, velocity, start/end time) from a piano roll or MIDI file.
- The ONNX model ("note_expression_control" network) predicts expressive performance parameters for each note.
- The DDSP decoder generates harmonic amplitude series and noise coefficients from those parameters.
- DSP synthesis runs inside a Web Audio API
AudioWorklet, outputting audio in real time.
LA Studio has implemented MIDI-DDSP as a browser-based plugin instrument you can try right now. You can synthesize violin, flute, trumpet, and other orchestral instruments in real time via a piano roll — all DDSP-powered — just by opening the editor. No installation or setup required.
The Other Web APIs That Make Browser Music AI Possible
ONNX and WebGPU don't work alone. Browser music AI relies on several Web APIs working together.
- Web Audio API: Handles audio decoding, buffering, and routing.
AudioWorkletmoves DSP processing off the main thread. - WebAssembly SIMD: Accelerates CPU-side FFT and matrix multiplication using SIMD instructions. DDSP's additive synthesis stage is often implemented in WASM.
- SharedArrayBuffer + Atomics: Enables zero-copy buffer sharing between the inference thread and the audio thread, minimizing latency.
- File System Access API: Caches local ONNX model files to avoid re-downloading them on each session.
- Web Workers: Offloads heavy inference to a background thread so the UI stays responsive.
Browser-Based ONNX Music Tools You Can Try Today
Here are practical tools worth knowing about — no theory required, just open and use.
- LA Studio (la-studio.cc): A browser DAW built on ONNX Runtime Web and WebGPU. Features include Demucs-powered AI stem separation, Basic Pitch-based BPM and key detection, and MIDI-DDSP neural instrument synthesis — all free, no account needed.
- Magenta Studio (browser version): A suite of MIDI generation and continuation tools from Google's Magenta team, running on TensorFlow.js.
- Basic Pitch (Spotify): An ONNX-powered web app that converts audio files to MIDI with accuracy that rivals desktop software.
- ONNX Model Zoo (music models): Quantized ONNX versions of MusicGen, EnCodec, and Demucs are publicly available for integration into your own web projects.
Developer Notes: Building a Browser Music App with ONNX
Key considerations if you're building your own ONNX-based browser music application.
Model Conversion and Optimization
- Export your PyTorch model with
torch.onnx.export(), then clean up unnecessary nodes withonnxsim. - Apply
quantize_dynamic()fromonnxruntimefor INT8 quantization — typically reduces model size by 50–75%. - Make sure dynamic axes (batch size, sequence length) are correctly defined; missing these is a common cause of browser inference failures.
Audio Pipeline Design
- Confirm the expected sample rate upfront: 16 kHz for most speech models, 44.1/48 kHz for music models. A mismatch is the single biggest source of quality degradation.
- Avoid running ONNX inference inside
AudioWorkletProcessor.process()— it only supports synchronous APIs. Instead, run async inference in a Web Worker and share results via SharedArrayBuffer. - For low-latency performance, use 128-sample buffers (~3 ms at 44.1 kHz). For quality-first processing, batch in 4096-sample chunks.
Conclusion: Where Browser Music AI Stands — and Where It's Headed
The combination of ONNX Runtime Web, WebGPU, and DDSP is dismantling the assumption that serious AI audio processing requires cloud infrastructure or native apps. As of 2025, stem separation, pitch correction, neural instrument synthesis, and audio-to-MIDI conversion all run at practical speeds entirely in the browser. There's a privacy benefit too — your audio never leaves your device. As WebGPU adoption grows and quantization techniques improve further, it's not hard to imagine neural synthesizers becoming a standard feature in browser-based DAWs in the near future.
If you want to see this in action right now, head to LA Studio's editor to try MIDI-DDSP neural synthesis and AI stem separation. No installation, completely free — experience the cutting edge of browser music AI firsthand.
Frequently Asked Questions
Q. Do I need a server to run an ONNX model in the browser?
A. No. With ONNX Runtime Web, you can host your .onnx model file on any static host (GitHub Pages, S3, etc.) and all inference runs entirely inside the user's browser. No server-side GPU or inference endpoint is required.
Q. Can I still use ONNX music AI in browsers that don't support WebGPU?
A. Yes. The runtime automatically falls back to the WebAssembly backend when WebGPU isn't available. It'll be slower, but it works in Firefox and Safari. Safari's WebGPU support has been progressing since 2024, so fast inference across a wider range of browsers is on the horizon.
Q. How is DDSP different from a sampler (SF2/SFZ)?
A. A sampler maps pre-recorded audio clips to velocity and pitch ranges and plays them back. DDSP has the model predict physical performance parameters and synthesizes the sound mathematically — which means it can generate expressive nuances like vibrato and breath continuously. The result is that the same MIDI note sounds subtly different each time, giving it an organic, "played" quality that samplers can't easily replicate.
Q. Does browser-based ONNX music AI work on mobile devices?
A. It can, but mobile memory and GPU performance are significantly more limited than on a desktop. Large models like Demucs may run very slowly or crash due to memory constraints on a phone. Lighter models like Basic Pitch can run at usable speeds on recent high-end smartphones.
Q. Which instruments does MIDI-DDSP support?
A. Magenta's official MIDI-DDSP models cover orchestral instruments including violin, flute, clarinet, and trumpet. Each instrument requires its own ONNX model, trained on that instrument's specific harmonic character and performance style. Piano is a struck instrument and tends to be handled by a different architecture — samplers are generally a better fit than DDSP for piano synthesis.