selfclean_audio.datasets#

Submodules

base

Base audio dataset class providing common functionality for audio loading and preprocessing.

csem

duplicate_dataset

esc50

folder

gtzan

label_error_dataset

noisy

off_topic_dataset

utils

class selfclean_audio.datasets.BaseAudioDataset(root: str | None = None, convert_mono: bool = True, sample_rate: int = 44100, target_duration_sec: float | None = None)[source]#

Base class for audio datasets with common preprocessing functionality.

Provides standardized audio loading, mono conversion, resampling, and duration handling.

Initialize base audio dataset.

Parameters:
  • root – Root directory path for the dataset

  • convert_mono – Convert stereo audio to mono if True

  • sample_rate – Target sample rate for audio (will resample if needed)

  • target_duration_sec – Target duration in seconds (will pad/trim if specified)

class selfclean_audio.datasets.DuplicateDataset(dataset: Dataset, frac_error: float = 0.1, n_errors: int | None = None, duplicate_strategy: Literal['exact', 'noisy', 'cropped', 'mixed', 'combined'] = 'exact', noise_level: float = 0.05, crop_ratio_range: tuple[float, float] = (0.1, 0.25), random_state: int = 42, name: str | None = None, save_to_temp: bool = True, temp_dir: str | None = None)[source]#

Dataset that creates near-duplicates by appending modified versions to the original dataset. This follows the same approach as the image domain implementation.

Parameters:
  • dataset – Original clean dataset

  • frac_error – Fraction of samples to duplicate

  • n_errors – Exact number of duplicates (overrides frac_error)

  • duplicate_strategy – Type of duplicate to create

  • noise_level – Noise level for noisy duplicates

  • crop_ratio_range – Range of crop ratios for cropped duplicates

  • random_state – Random seed for reproducibility

  • name – Dataset name for logging

cleanup_temp_dir() None[source]#

Remove the temporary directory containing synthetic duplicate files.

get_errors() tuple[set[tuple[int, int]], list[str]][source]#

Return ground truth duplicate pairs. This matches the interface from the image domain.

Returns:

Set of (original_idx, duplicate_idx) tuples List of error type names

info()[source]#

Print dataset information

class selfclean_audio.datasets.LabelErrorDataset(dataset: Dataset, frac_error: float = 0.1, n_errors: int | None = None, change_for_every_label: bool = False, random_state: int = 42, name: str | None = None)[source]#

Dataset that creates label errors by changing labels of selected samples. This follows the same approach as the image domain implementation.

Parameters:
  • dataset – Original clean dataset

  • frac_error – Fraction of samples with label errors

  • n_errors – Exact number of errors (overrides frac_error)

  • change_for_every_label – If True, change labels for each class separately

  • random_state – Random seed for reproducibility

  • name – Dataset name for logging

get_errors() list[int][source]#

Return ground truth label error indicators. This matches the interface from the image domain.

Returns:

List of 0/1 indicating whether each sample has a label error

info()[source]#

Print dataset information

class selfclean_audio.datasets.OffTopicDataset(dataset: Dataset, contamination_dataset: Dataset | None = None, frac_error: float = 0.1, n_errors: int | None = None, contamination_strategy: Literal['external', 'noise', 'corrupted', 'combined'] = 'combined', noise_level: float = 0.5, random_state: int = 42, name: str | None = None)[source]#

Dataset that creates off-topic samples by: 1. Adding samples from an unrelated dataset 2. Adding pure noise samples 3. Adding heavily corrupted versions of original samples

This follows the same approach as the image domain implementation.

Parameters:
  • dataset – Original clean dataset

  • contamination_dataset – External dataset for contamination (e.g., MUSAN for ESC50)

  • frac_error – Fraction of samples to contaminate

  • n_errors – Exact number of contaminated samples (overrides frac_error)

  • contamination_strategy – How to create off-topic samples

  • noise_level – Level of noise/corruption (0-1)

  • random_state – Random seed for reproducibility

  • name – Dataset name for logging

get_errors() list[int][source]#

Return ground truth off-topic indicators. This matches the interface from the image domain.

Returns:

List of 0/1 indicating whether each sample is off-topic

info()[source]#

Print dataset information

class selfclean_audio.datasets.GTZANKnownIssuesDataset(root: str | Path, issue_type: str = 'duplicates', gt_duplicates_file: str | Path | None = None, gt_prep_file: str | Path | None = None, convert_mono: bool = True, sample_rate: int = 16000, target_duration_sec: float | None = 30.0, extensions: tuple[str, ...] = ('.wav', '.mp3', '.flac'))[source]#

GTZAN dataset wrapper with built-in access to known data quality issues.

  • Exposes audio samples from a local GTZAN folder (genres/<class>/*.wav)

  • Parses ground truth CSVs with known issues from external_code

  • Provides get_errors() compatible with SelfClean evaluation:
    • For ISSUE_TYPE “duplicates”: returns (set of (idx_i, idx_j) pairs, [..labels..])

    • For ISSUE_TYPE “label_errors”: returns a list[int] of 0/1 per sample

get_errors()[source]#

Return evaluation ground truth in the format expected by SelfClean.

  • If evaluating duplicates: returns (set[(i, j)], [..labels..])

  • If evaluating label errors: returns list[int] of length len(self)

class selfclean_audio.datasets.CSEMMembranePumps(root: str | Path, convert_mono: bool = True, sample_rate: int = 16000, target_duration_sec: float | None = None, index_file: str | Path = 'index.csv', files_dir: str | Path = 'files')[source]#

CSEM Membrane Pump Audio Dataset loader.

Expects the following structure under root (see data/CSEM/README.md):

root/
    files/
        {guid}.wav
        ...
    index.csv   # columns: id, filename, label

The dataset returns tuples of (waveform, absolute_path, label). noisy_label is not known and will be set by the synthetic/noise wrappers when applicable, otherwise considered 0 by downstream code.

get_errors()[source]#

Return dummy ground truth for CSEM dataset (no ground truth available).

Returns empty lists to bypass scoring requirements while allowing ranking generation to proceed.

selfclean_audio.datasets.extract_sample(sample_tuple: tuple) Tuple[Tensor, str | None, int, Tensor][source]#

Normalize dataset samples to a common shape.

Accept tuples returned by various dataset implementations and return a tuple of (waveform, path, label, noisy_label) where: - waveform: torch.Tensor, shape (C, T) or (T,) - path: Optional[str] (file path or None if unavailable) - label: int (class index) - noisy_label: torch.Tensor scalar long, 0 for clean, 1 for noisy

This helper avoids repeated ad-hoc tuple length checks across the codebase.