← Notes

How DANSE actually works

I spent four years on this algorithm. Here is the version I wish someone had given me on day one: the problem it solves, the one idea that makes it work, and enough Python to convince yourself. A companion notebook runs the whole thing in about sixty lines.

Many microphones, no central machine

Picture a living room full of devices: a hearing aid, a phone, a smart speaker, a laptop. Each carries microphones and hears the same conversation from a different spot. Together they form a wireless acoustic sensor network (WASN), and more microphones over a wider area mean cleaner speech than any single device gets alone.

Two side-by-side networks of devices. Left, labelled centralized: every device connects with a red link to a central cylinder (the fusion center). Right, labelled distributed: devices connect to each other directly with dashed wireless links and no center.
Two ways to use all those microphones. On the left, every node pipes raw audio into a single fusion center. On the right, nodes trade only small fused signals, with no center left to fail.

The obvious approach ships every signal to one machine that runs a single big filter. That centralized scheme works on paper, but it floods one device with raw audio, pools everyone's microphone data in one place, and fails completely if that device drops off. Distributed processing avoids all of this: each node filters its own microphones and shares only a small compressed signal. The question DANSE answers is whether a network of nodes, each sending almost nothing, can still match the all-knowing central machine. For a large class of problems, it can, exactly.

The nodes do not even want the same thing. A binaural hearing aid must preserve spatial cues, so the left and right ears want different enhanced signals. This is node-specific estimation: every node estimates the clean speech at its own reference microphone.

The benchmark: the centralized Wiener filter

Stack every microphone signal into one long vector y\mathbf{y}. If one machine saw all of y\mathbf{y}, the best linear estimator of node kk's target, in the mean-square-error sense, is the multichannel Wiener filter:

Wk=Ryy1RssEk \mathbf{W}_k = \mathbf{R}_{yy}^{-1} \mathbf{R}_{ss} \mathbf{E}_k

Here Ryy\mathbf{R}_{yy} is the covariance of the noisy signals, Rss\mathbf{R}_{ss} that of the speech-only part, and Ek\mathbf{E}_k picks out node kk's reference microphone. You recover Rss=RyyRnn\mathbf{R}_{ss} = \mathbf{R}_{yy} - \mathbf{R}_{nn} by estimating Ryy\mathbf{R}_{yy} during speech and Rnn\mathbf{R}_{nn} during silence, gated by a voice activity detector. That is the bar DANSE has to clear.

The DANSE idea, and the one trick

DANSE gets each node to that same filter without anyone assembling y\mathbf{y}. Each node compresses its microphones into a tiny fused signal (one channel in the simplest case) and broadcasts it; in return it receives one fused channel from every other node. It then solves a local Wiener filter on its own microphones plus those fused channels, broadcasts again, and repeats, one node at a time.

Top: five nodes in a ring, with the updating node circled moving from node 1 to node 2 to node 3 across iterations, while all nodes exchange fused signals. Bottom: a zoom into one node showing local sensor signals and incoming fused signals entering an update-filter block that produces both the desired signal estimate and the fused signal broadcast out.
Nodes update one at a time, round-robin. Inside a node, a single filter does two jobs: it estimates that node's own signal, and its top block fuses the local microphones into the one channel the node broadcasts.

Everything hinges on how a node compresses its microphones. The trick is almost too simple: when a node solves its local filter, that filter splits into a block for its own microphones and a block for the fused signals. The fusion matrix is just that first block, reused. The coefficients a node uses to weigh its microphones for itself are exactly the ones it uses to compress them for everyone else. At the fixed point this discards nothing the others needed, so the local filters land on the centralized solution.

Watching it converge in Python

Everything below runs at one STFT frequency bin, which turns the convolutive mixture into plain matrix algebra. Set up a four-node network, twelve microphones, one speaker and two noise sources, then solve the centralized benchmark:

import numpy as np
rng = np.random.default_rng(0)

Mk = [3, 3, 3, 3]                 # mics per node
K, M = len(Mk), sum(Mk)           # 4 nodes, 12 mics
off = np.concatenate([[0], np.cumsum(Mk)])
Qs, Qn = 1, 2                     # 1 speaker, 2 noise sources
D = Q = Qs                        # 1 desired + 1 fused channel per node

def cplx(shape):
    return rng.standard_normal(shape) + 1j * rng.standard_normal(shape)

A, B = cplx((M, Qs)), cplx((M, Qn))       # speaker / noise transfer functions
Rss = A @ A.conj().T
Rnn = B @ B.conj().T + 0.05 * np.eye(M)   # plus sensor self-noise
Ryy = Rss + Rnn

def Sel(k):                                # selector: node k's own mics
    s = np.zeros((M, Mk[k])); s[off[k]:off[k+1]] = np.eye(Mk[k]); return s
def Ref(k):                                # selector: node k's reference mic
    e = np.zeros((M, D)); e[off[k]:off[k]+D] = np.eye(D); return e

W_cen = [np.linalg.solve(Ryy, Rss @ Ref(k)) for k in range(K)]   # benchmark

Now DANSE itself. Each node builds its observation (own mics plus the fused channels from everyone else), solves a local filter, and reuses the top block as its fusion matrix:

P = [cplx((Mk[k], Q)) for k in range(K)]   # fusion matrices, random init
W_eq = [None] * K                          # each node's filter, seen globally

def stacked(u):                            # node u: own mics + others' fused channels
    cols = [Sel(u)] + [Sel(j) @ P[j] for j in range(K) if j != u]
    return np.hstack(cols)

u = 0
for _ in range(40):
    T  = stacked(u)
    Wt = np.linalg.solve(T.conj().T @ Ryy @ T, T.conj().T @ (Rss @ Ref(u)))
    P[u]    = Wt[:Mk[u]]                    # the trick: reuse the top block as fusion
    W_eq[u] = T @ Wt
    u = (u + 1) % K
A log-scale plot of relative cost gap to the centralized MWF against DANSE iteration, with one staircase-shaped curve per node, all descending from roughly 1e-1 to below 1e-11 over forty iterations.
Each node's filter converging to its centralized optimum. The staircase comes from round-robin updating: a node only steps down on its own turn.

Within a few rounds every node sits on the centralized optimum, while the link only ever carried one channel per node instead of three. That compression, with no loss of optimality, is the whole point.

What the clean demo skips

The demo uses the true covariances and a static room. Real deployments are messier, and most of my thesis was about the mess:

  • You have to estimate the statistics from data, gated by an imperfect voice activity detector and an exponential forgetting factor whose value trades tracking speed against noise.
  • Updating a fusion matrix changes the signal others are averaging, so the practical fix keeps averaging with a long forgetting factor and never resets.
  • Sequential, one-node-at-a-time updating is slow. Updating all nodes at once is faster but needs a relaxation factor to stay stable.
  • Chained filters add latency, which a weighted overlap-add framework keeps to one frame per stage.

The assumptions, stated honestly

The clean result leans on three assumptions, and much of the literature is the story of removing them: every node hears the same sources, the network is fully connected, and clocks are perfectly synchronized. GEVD-DANSE relaxes the rank condition at low signal-to-noise ratio. TI-DANSE and its faster successor drop full connectivity for arbitrary, changing topologies, which was the core of my own work. And sampling-rate-offset estimation and compensation inside DANSE handles the moment two cheap oscillators drift apart.

Where to go from here

The runnable notebook is the fastest way in: open it in Google Colab or download the .ipynb to run it locally. For the full theory, see Bertrand and Moonen's original DANSE papers (IEEE Transactions on Signal Processing, 2010), the relevant chapters of my thesis, and the linked publications.