selfclean_audio.datasets.off_topic_dataset#

Members

OffTopicDataset

Dataset that creates off-topic samples by: 1.

class selfclean_audio.datasets.off_topic_dataset.OffTopicDataset(dataset: Dataset, contamination_dataset: Dataset | None = None, frac_error: float = 0.1, n_errors: int | None = None, contamination_strategy: Literal['external', 'noise', 'corrupted', 'combined'] = 'combined', noise_level: float = 0.5, random_state: int = 42, name: str | None = None)[source]#

Dataset that creates off-topic samples by: 1. Adding samples from an unrelated dataset 2. Adding pure noise samples 3. Adding heavily corrupted versions of original samples

This follows the same approach as the image domain implementation.

Parameters:
  • dataset – Original clean dataset

  • contamination_dataset – External dataset for contamination (e.g., MUSAN for ESC50)

  • frac_error – Fraction of samples to contaminate

  • n_errors – Exact number of contaminated samples (overrides frac_error)

  • contamination_strategy – How to create off-topic samples

  • noise_level – Level of noise/corruption (0-1)

  • random_state – Random seed for reproducibility

  • name – Dataset name for logging

get_errors() list[int][source]#

Return ground truth off-topic indicators. This matches the interface from the image domain.

Returns:

List of 0/1 indicating whether each sample is off-topic

info()[source]#

Print dataset information