Decision Records#

This section documents the key research and development decisions made during the project.

We track decisions directly on this page for now. When formal decision records are added, they will live under docs/explanations/decisions/.

Audio Issue Categorization#

We categorize audio quality issues at two levels:

  • File Level: issues affecting the entire audio file.

    • Out-of-distribution audio

    • Duplicate files

    • Silent audio

  • Segment (Audio) Level: issues affecting parts of the audio.

    • Looping artifacts

    • Gaussian noise

    • White noise

    • Background noise from other sources

To simulate these issues, we mix noise into the original audio using a Signal-to-Noise Ratio (SNR) mixer adapted from our internal experiments.

Embedding Pooling Strategy#

Since SelfClean processes each sample independently and does not account for temporal structure, our current approach treats any detected issue (e.g., white noise) as a file-level problem. However, future work could explore more fine-grained localization of issues within segments.

To better aggregate temporal information, we’ve implemented multiple pooling strategies for embedding generation:

if pool == "CLS":  # Use class token
    emb = emb[:, 0, :]
elif pool == "Mean":  # Average pooling
    emb = emb.mean(dim=1)
elif pool == "Reshape":  # Flatten all tokens
    emb = emb.reshape(-1)

Datasets Used#

Dataset

Classes

Samples

Duration

Sampling Rate

ARCA23K

17,979

7.92

44100

AudioSet20K

527

39,436

9.89

32000

Pianos

8

668

4.86

16000

WMMS

31

1695

10.42

16000

GTZAN

10

930

30.02

22050