selfclean_audio.datasets#
Submodules
Base audio dataset class providing common functionality for audio loading and preprocessing. |
|
- class selfclean_audio.datasets.BaseAudioDataset(root: str | None = None, convert_mono: bool = True, sample_rate: int = 44100, target_duration_sec: float | None = None)[source]#
Base class for audio datasets with common preprocessing functionality.
Provides standardized audio loading, mono conversion, resampling, and duration handling.
Initialize base audio dataset.
- Parameters:
root – Root directory path for the dataset
convert_mono – Convert stereo audio to mono if True
sample_rate – Target sample rate for audio (will resample if needed)
target_duration_sec – Target duration in seconds (will pad/trim if specified)
- class selfclean_audio.datasets.DuplicateDataset(dataset: Dataset, frac_error: float = 0.1, n_errors: int | None = None, duplicate_strategy: Literal['exact', 'noisy', 'cropped', 'mixed', 'combined'] = 'exact', noise_level: float = 0.05, crop_ratio_range: tuple[float, float] = (0.1, 0.25), random_state: int = 42, name: str | None = None, save_to_temp: bool = True, temp_dir: str | None = None)[source]#
Dataset that creates near-duplicates by appending modified versions to the original dataset. This follows the same approach as the image domain implementation.
- Parameters:
dataset – Original clean dataset
frac_error – Fraction of samples to duplicate
n_errors – Exact number of duplicates (overrides frac_error)
duplicate_strategy – Type of duplicate to create
noise_level – Noise level for noisy duplicates
crop_ratio_range – Range of crop ratios for cropped duplicates
random_state – Random seed for reproducibility
name – Dataset name for logging
- cleanup_temp_dir() None[source]#
Remove the temporary directory containing synthetic duplicate files.
- class selfclean_audio.datasets.LabelErrorDataset(dataset: Dataset, frac_error: float = 0.1, n_errors: int | None = None, change_for_every_label: bool = False, random_state: int = 42, name: str | None = None)[source]#
Dataset that creates label errors by changing labels of selected samples. This follows the same approach as the image domain implementation.
- Parameters:
dataset – Original clean dataset
frac_error – Fraction of samples with label errors
n_errors – Exact number of errors (overrides frac_error)
change_for_every_label – If True, change labels for each class separately
random_state – Random seed for reproducibility
name – Dataset name for logging
- class selfclean_audio.datasets.OffTopicDataset(dataset: Dataset, contamination_dataset: Dataset | None = None, frac_error: float = 0.1, n_errors: int | None = None, contamination_strategy: Literal['external', 'noise', 'corrupted', 'combined'] = 'combined', noise_level: float = 0.5, random_state: int = 42, name: str | None = None)[source]#
Dataset that creates off-topic samples by: 1. Adding samples from an unrelated dataset 2. Adding pure noise samples 3. Adding heavily corrupted versions of original samples
This follows the same approach as the image domain implementation.
- Parameters:
dataset – Original clean dataset
contamination_dataset – External dataset for contamination (e.g., MUSAN for ESC50)
frac_error – Fraction of samples to contaminate
n_errors – Exact number of contaminated samples (overrides frac_error)
contamination_strategy – How to create off-topic samples
noise_level – Level of noise/corruption (0-1)
random_state – Random seed for reproducibility
name – Dataset name for logging
- class selfclean_audio.datasets.GTZANKnownIssuesDataset(root: str | Path, issue_type: str = 'duplicates', gt_duplicates_file: str | Path | None = None, gt_prep_file: str | Path | None = None, convert_mono: bool = True, sample_rate: int = 16000, target_duration_sec: float | None = 30.0, extensions: tuple[str, ...] = ('.wav', '.mp3', '.flac'))[source]#
GTZAN dataset wrapper with built-in access to known data quality issues.
Exposes audio samples from a local GTZAN folder (
genres/<class>/*.wav)Parses ground truth CSVs with known issues from external_code
- Provides
get_errors()compatible with SelfClean evaluation: For ISSUE_TYPE “duplicates”: returns (set of (idx_i, idx_j) pairs, [..labels..])
For ISSUE_TYPE “label_errors”: returns a list[int] of 0/1 per sample
- Provides
- class selfclean_audio.datasets.CSEMMembranePumps(root: str | Path, convert_mono: bool = True, sample_rate: int = 16000, target_duration_sec: float | None = None, index_file: str | Path = 'index.csv', files_dir: str | Path = 'files')[source]#
CSEM Membrane Pump Audio Dataset loader.
Expects the following structure under
root(seedata/CSEM/README.md):root/ files/ {guid}.wav ... index.csv # columns: id, filename, label
The dataset returns tuples of (waveform, absolute_path, label).
noisy_labelis not known and will be set by the synthetic/noise wrappers when applicable, otherwise considered 0 by downstream code.
- selfclean_audio.datasets.extract_sample(sample_tuple: tuple) Tuple[Tensor, str | None, int, Tensor][source]#
Normalize dataset samples to a common shape.
Accept tuples returned by various dataset implementations and return a tuple of (waveform, path, label, noisy_label) where: - waveform: torch.Tensor, shape (C, T) or (T,) - path: Optional[str] (file path or None if unavailable) - label: int (class index) - noisy_label: torch.Tensor scalar long, 0 for clean, 1 for noisy
This helper avoids repeated ad-hoc tuple length checks across the codebase.