Decision Records#

This section documents the key research and development decisions made during the project.

We track decisions directly on this page for now. When formal decision records are added, they will live under docs/explanations/decisions/.

Audio Issue Categorization#

We categorize audio quality issues at two levels:

File Level: issues affecting the entire audio file.
- Out-of-distribution audio
- Duplicate files
- Silent audio
Segment (Audio) Level: issues affecting parts of the audio.
- Looping artifacts
- Gaussian noise
- White noise
- Background noise from other sources

To simulate these issues, we mix noise into the original audio using a Signal-to-Noise Ratio (SNR) mixer adapted from our internal experiments.

Embedding Pooling Strategy#

Since SelfClean processes each sample independently and does not account for temporal structure, our current approach treats any detected issue (e.g., white noise) as a file-level problem. However, future work could explore more fine-grained localization of issues within segments.

To better aggregate temporal information, we’ve implemented multiple pooling strategies for embedding generation:

if pool == "CLS":  # Use class token
    emb = emb[:, 0, :]
elif pool == "Mean":  # Average pooling
    emb = emb.mean(dim=1)
elif pool == "Reshape":  # Flatten all tokens
    emb = emb.reshape(-1)

Datasets Used#

Dataset	Classes	Samples	Duration	Sampling Rate
ARCA23K	–	17,979	7.92	44100
AudioSet20K	527	39,436	9.89	32000
Pianos	8	668	4.86	16000
WMMS	31	1695	10.42	16000
GTZAN	10	930	30.02	22050

Decision Records#

Audio Issue Categorization#

Embedding Pooling Strategy#

Datasets Used#

This Page