Decision Records#
This section documents the key research and development decisions made during the project.
We track decisions directly on this page for now. When formal decision records
are added, they will live under docs/explanations/decisions/.
Audio Issue Categorization#
We categorize audio quality issues at two levels:
File Level: issues affecting the entire audio file.
Out-of-distribution audio
Duplicate files
Silent audio
Segment (Audio) Level: issues affecting parts of the audio.
Looping artifacts
Gaussian noise
White noise
Background noise from other sources
To simulate these issues, we mix noise into the original audio using a Signal-to-Noise Ratio (SNR) mixer adapted from our internal experiments.
Embedding Pooling Strategy#
Since SelfClean processes each sample independently and does not account for temporal structure, our current approach treats any detected issue (e.g., white noise) as a file-level problem. However, future work could explore more fine-grained localization of issues within segments.
To better aggregate temporal information, we’ve implemented multiple pooling strategies for embedding generation:
if pool == "CLS": # Use class token
emb = emb[:, 0, :]
elif pool == "Mean": # Average pooling
emb = emb.mean(dim=1)
elif pool == "Reshape": # Flatten all tokens
emb = emb.reshape(-1)
Datasets Used#
Dataset |
Classes |
Samples |
Duration |
Sampling Rate |
|---|---|---|---|---|
ARCA23K |
– |
17,979 |
7.92 |
44100 |
AudioSet20K |
527 |
39,436 |
9.89 |
32000 |
Pianos |
8 |
668 |
4.86 |
16000 |
WMMS |
31 |
1695 |
10.42 |
16000 |
GTZAN |
10 |
930 |
30.02 |
22050 |