How DANSE actually works
I spent four years on this algorithm. Here is the version I wish someone had given me on day one: the problem it solves, the one idea that makes it work, and enough Python to convince yourself. A companion notebook runs the whole thing in about sixty lines.
Many microphones, no central machine
Picture a living room full of devices: a hearing aid, a phone, a smart speaker, a laptop. Each carries microphones and hears the same conversation from a different spot. Together they form a wireless acoustic sensor network (WASN), and more microphones over a wider area mean cleaner speech than any single device gets alone.
The obvious approach ships every signal to one machine that runs a single big filter. That centralized scheme works on paper, but it floods one device with raw audio, pools everyone's microphone data in one place, and fails completely if that device drops off. Distributed processing avoids all of this: each node filters its own microphones and shares only a small compressed signal. The question DANSE answers is whether a network of nodes, each sending almost nothing, can still match the all-knowing central machine. For a large class of problems, it can, exactly.
The nodes do not even want the same thing. A binaural hearing aid must preserve spatial cues, so the left and right ears want different enhanced signals. This is node-specific estimation: every node estimates the clean speech at its own reference microphone.
The benchmark: the centralized Wiener filter
Stack every microphone signal into one long vector
Here
The DANSE idea, and the one trick
DANSE gets each node to that same filter without anyone assembling
Everything hinges on how a node compresses its microphones. The trick is almost too simple: when a node solves its local filter, that filter splits into a block for its own microphones and a block for the fused signals. The fusion matrix is just that first block, reused. The coefficients a node uses to weigh its microphones for itself are exactly the ones it uses to compress them for everyone else. At the fixed point this discards nothing the others needed, so the local filters land on the centralized solution.
Watching it converge in Python
Everything below runs at one STFT frequency bin, which turns the convolutive mixture into plain matrix algebra. Set up a four-node network, twelve microphones, one speaker and two noise sources, then solve the centralized benchmark:
import numpy as np
rng = np.random.default_rng(0)
Mk = [3, 3, 3, 3] # mics per node
K, M = len(Mk), sum(Mk) # 4 nodes, 12 mics
off = np.concatenate([[0], np.cumsum(Mk)])
Qs, Qn = 1, 2 # 1 speaker, 2 noise sources
D = Q = Qs # 1 desired + 1 fused channel per node
def cplx(shape):
return rng.standard_normal(shape) + 1j * rng.standard_normal(shape)
A, B = cplx((M, Qs)), cplx((M, Qn)) # speaker / noise transfer functions
Rss = A @ A.conj().T
Rnn = B @ B.conj().T + 0.05 * np.eye(M) # plus sensor self-noise
Ryy = Rss + Rnn
def Sel(k): # selector: node k's own mics
s = np.zeros((M, Mk[k])); s[off[k]:off[k+1]] = np.eye(Mk[k]); return s
def Ref(k): # selector: node k's reference mic
e = np.zeros((M, D)); e[off[k]:off[k]+D] = np.eye(D); return e
W_cen = [np.linalg.solve(Ryy, Rss @ Ref(k)) for k in range(K)] # benchmark
Now DANSE itself. Each node builds its observation (own mics plus the fused channels from everyone else), solves a local filter, and reuses the top block as its fusion matrix:
P = [cplx((Mk[k], Q)) for k in range(K)] # fusion matrices, random init
W_eq = [None] * K # each node's filter, seen globally
def stacked(u): # node u: own mics + others' fused channels
cols = [Sel(u)] + [Sel(j) @ P[j] for j in range(K) if j != u]
return np.hstack(cols)
u = 0
for _ in range(40):
T = stacked(u)
Wt = np.linalg.solve(T.conj().T @ Ryy @ T, T.conj().T @ (Rss @ Ref(u)))
P[u] = Wt[:Mk[u]] # the trick: reuse the top block as fusion
W_eq[u] = T @ Wt
u = (u + 1) % K
Within a few rounds every node sits on the centralized optimum, while the link only ever carried one channel per node instead of three. That compression, with no loss of optimality, is the whole point.
What the clean demo skips
The demo uses the true covariances and a static room. Real deployments are messier, and most of my thesis was about the mess:
- You have to estimate the statistics from data, gated by an imperfect voice activity detector and an exponential forgetting factor whose value trades tracking speed against noise.
- Updating a fusion matrix changes the signal others are averaging, so the practical fix keeps averaging with a long forgetting factor and never resets.
- Sequential, one-node-at-a-time updating is slow. Updating all nodes at once is faster but needs a relaxation factor to stay stable.
- Chained filters add latency, which a weighted overlap-add framework keeps to one frame per stage.
The assumptions, stated honestly
The clean result leans on three assumptions, and much of the literature is the story of removing them: every node hears the same sources, the network is fully connected, and clocks are perfectly synchronized. GEVD-DANSE relaxes the rank condition at low signal-to-noise ratio. TI-DANSE and its faster successor drop full connectivity for arbitrary, changing topologies, which was the core of my own work. And sampling-rate-offset estimation and compensation inside DANSE handles the moment two cheap oscillators drift apart.
Where to go from here
The runnable notebook is the fastest way in: open it in Google Colab or download the .ipynb to run it locally. For the full theory, see Bertrand and Moonen's original DANSE papers (IEEE Transactions on Signal Processing, 2010), the relevant chapters of my thesis, and the linked publications.