`selfclean_audio.datasets`#

Submodules

`base`	Base audio dataset class providing common functionality for audio loading and preprocessing.
`csem`
`duplicate_dataset`
`esc50`
`folder`
`gtzan`
`label_error_dataset`
`noisy`
`off_topic_dataset`
`utils`

class selfclean_audio.datasets.BaseAudioDataset(root: str | None = None, convert_mono: bool = True, sample_rate: int = 44100, target_duration_sec: float | None = None)[source]#

Base class for audio datasets with common preprocessing functionality.

Provides standardized audio loading, mono conversion, resampling, and duration handling.

Initialize base audio dataset.

Parameters:

root – Root directory path for the dataset
convert_mono – Convert stereo audio to mono if True
sample_rate – Target sample rate for audio (will resample if needed)
target_duration_sec – Target duration in seconds (will pad/trim if specified)

class selfclean_audio.datasets.DuplicateDataset(dataset: Dataset, frac_error: float = 0.1, n_errors: int | None = None, duplicate_strategy: Literal['exact', 'noisy', 'cropped', 'mixed', 'combined'] = 'exact', noise_level: float = 0.05, crop_ratio_range: tuple[float, float] = (0.1, 0.25), random_state: int = 42, name: str | None = None, save_to_temp: bool = True, temp_dir: str | None = None)[source]#

Dataset that creates near-duplicates by appending modified versions to the original dataset. This follows the same approach as the image domain implementation.

Parameters:

dataset – Original clean dataset
frac_error – Fraction of samples to duplicate
n_errors – Exact number of duplicates (overrides frac_error)
duplicate_strategy – Type of duplicate to create
noise_level – Noise level for noisy duplicates
crop_ratio_range – Range of crop ratios for cropped duplicates
random_state – Random seed for reproducibility
name – Dataset name for logging

cleanup_temp_dir() → None[source]#: Remove the temporary directory containing synthetic duplicate files.

get_errors() → tuple[set[tuple[int, int]], list[str]][source]#

Return ground truth duplicate pairs. This matches the interface from the image domain.

Returns:: Set of (original_idx, duplicate_idx) tuples List of error type names

info()[source]#: Print dataset information

class selfclean_audio.datasets.LabelErrorDataset(dataset: Dataset, frac_error: float = 0.1, n_errors: int | None = None, change_for_every_label: bool = False, random_state: int = 42, name: str | None = None)[source]#

Dataset that creates label errors by changing labels of selected samples. This follows the same approach as the image domain implementation.

Parameters:

dataset – Original clean dataset
frac_error – Fraction of samples with label errors
n_errors – Exact number of errors (overrides frac_error)
change_for_every_label – If True, change labels for each class separately
random_state – Random seed for reproducibility
name – Dataset name for logging

get_errors() → list[int][source]#

Return ground truth label error indicators. This matches the interface from the image domain.

Returns:: List of 0/1 indicating whether each sample has a label error

info()[source]#: Print dataset information

class selfclean_audio.datasets.OffTopicDataset(dataset: Dataset, contamination_dataset: Dataset | None = None, frac_error: float = 0.1, n_errors: int | None = None, contamination_strategy: Literal['external', 'noise', 'corrupted', 'combined'] = 'combined', noise_level: float = 0.5, random_state: int = 42, name: str | None = None)[source]#

Dataset that creates off-topic samples by: 1. Adding samples from an unrelated dataset 2. Adding pure noise samples 3. Adding heavily corrupted versions of original samples

This follows the same approach as the image domain implementation.

Parameters:

dataset – Original clean dataset
contamination_dataset – External dataset for contamination (e.g., MUSAN for ESC50)
frac_error – Fraction of samples to contaminate
n_errors – Exact number of contaminated samples (overrides frac_error)
contamination_strategy – How to create off-topic samples
noise_level – Level of noise/corruption (0-1)
random_state – Random seed for reproducibility
name – Dataset name for logging

get_errors() → list[int][source]#

Return ground truth off-topic indicators. This matches the interface from the image domain.

Returns:: List of 0/1 indicating whether each sample is off-topic

info()[source]#: Print dataset information

class selfclean_audio.datasets.GTZANKnownIssuesDataset(root: str | Path, issue_type: str = 'duplicates', gt_duplicates_file: str | Path | None = None, gt_prep_file: str | Path | None = None, convert_mono: bool = True, sample_rate: int = 16000, target_duration_sec: float | None = 30.0, extensions: tuple[str, ...] = ('.wav', '.mp3', '.flac'))[source]#

GTZAN dataset wrapper with built-in access to known data quality issues.

Exposes audio samples from a local GTZAN folder (genres/<class>/*.wav)
Parses ground truth CSVs with known issues from external_code
Provides get_errors() compatible with SelfClean evaluation:
- For ISSUE_TYPE “duplicates”: returns (set of (idx_i, idx_j) pairs, [..labels..])
- For ISSUE_TYPE “label_errors”: returns a list[int] of 0/1 per sample

get_errors()[source]#

Return evaluation ground truth in the format expected by SelfClean.

If evaluating duplicates: returns (set[(i, j)], [..labels..])
If evaluating label errors: returns list[int] of length len(self)

class selfclean_audio.datasets.CSEMMembranePumps(root: str | Path, convert_mono: bool = True, sample_rate: int = 16000, target_duration_sec: float | None = None, index_file: str | Path = 'index.csv', files_dir: str | Path = 'files')[source]#

CSEM Membrane Pump Audio Dataset loader.

Expects the following structure under root (see data/CSEM/README.md):

root/
    files/
        {guid}.wav
        ...
    index.csv   # columns: id, filename, label

The dataset returns tuples of (waveform, absolute_path, label). noisy_label is not known and will be set by the synthetic/noise wrappers when applicable, otherwise considered 0 by downstream code.

get_errors()[source]#

Return dummy ground truth for CSEM dataset (no ground truth available).

Returns empty lists to bypass scoring requirements while allowing ranking generation to proceed.

selfclean_audio.datasets.extract_sample(sample_tuple: tuple) → Tuple[Tensor, str | None, int, Tensor][source]#

Normalize dataset samples to a common shape.

Accept tuples returned by various dataset implementations and return a tuple of (waveform, path, label, noisy_label) where: - waveform: torch.Tensor, shape (C, T) or (T,) - path: Optional[str] (file path or None if unavailable) - label: int (class index) - noisy_label: torch.Tensor scalar long, 0 for clean, 1 for noisy

This helper avoids repeated ad-hoc tuple length checks across the codebase.

selfclean_audio.datasets#

This Page

`selfclean_audio.datasets`#