selfclean_audio.datasets.off_topic_dataset#
Members
Dataset that creates off-topic samples by: 1. |
- class selfclean_audio.datasets.off_topic_dataset.OffTopicDataset(dataset: Dataset, contamination_dataset: Dataset | None = None, frac_error: float = 0.1, n_errors: int | None = None, contamination_strategy: Literal['external', 'noise', 'corrupted', 'combined'] = 'combined', noise_level: float = 0.5, random_state: int = 42, name: str | None = None)[source]#
Dataset that creates off-topic samples by: 1. Adding samples from an unrelated dataset 2. Adding pure noise samples 3. Adding heavily corrupted versions of original samples
This follows the same approach as the image domain implementation.
- Parameters:
dataset – Original clean dataset
contamination_dataset – External dataset for contamination (e.g., MUSAN for ESC50)
frac_error – Fraction of samples to contaminate
n_errors – Exact number of contaminated samples (overrides frac_error)
contamination_strategy – How to create off-topic samples
noise_level – Level of noise/corruption (0-1)
random_state – Random seed for reproducibility
name – Dataset name for logging