selfclean_audio.datasets.duplicate_dataset#

Members

DuplicateDataset

Dataset that creates near-duplicates by appending modified versions to the original dataset.

class selfclean_audio.datasets.duplicate_dataset.DuplicateDataset(dataset: Dataset, frac_error: float = 0.1, n_errors: int | None = None, duplicate_strategy: Literal['exact', 'noisy', 'cropped', 'mixed', 'combined'] = 'exact', noise_level: float = 0.05, crop_ratio_range: tuple[float, float] = (0.1, 0.25), random_state: int = 42, name: str | None = None, save_to_temp: bool = True, temp_dir: str | None = None)[source]#

Dataset that creates near-duplicates by appending modified versions to the original dataset. This follows the same approach as the image domain implementation.

Parameters:
  • dataset – Original clean dataset

  • frac_error – Fraction of samples to duplicate

  • n_errors – Exact number of duplicates (overrides frac_error)

  • duplicate_strategy – Type of duplicate to create

  • noise_level – Noise level for noisy duplicates

  • crop_ratio_range – Range of crop ratios for cropped duplicates

  • random_state – Random seed for reproducibility

  • name – Dataset name for logging

cleanup_temp_dir() None[source]#

Remove the temporary directory containing synthetic duplicate files.

get_errors() tuple[set[tuple[int, int]], list[str]][source]#

Return ground truth duplicate pairs. This matches the interface from the image domain.

Returns:

Set of (original_idx, duplicate_idx) tuples List of error type names

info()[source]#

Print dataset information