selfclean_audio.datasets.duplicate_dataset#
Members
Dataset that creates near-duplicates by appending modified versions to the original dataset. |
- class selfclean_audio.datasets.duplicate_dataset.DuplicateDataset(dataset: Dataset, frac_error: float = 0.1, n_errors: int | None = None, duplicate_strategy: Literal['exact', 'noisy', 'cropped', 'mixed', 'combined'] = 'exact', noise_level: float = 0.05, crop_ratio_range: tuple[float, float] = (0.1, 0.25), random_state: int = 42, name: str | None = None, save_to_temp: bool = True, temp_dir: str | None = None)[source]#
Dataset that creates near-duplicates by appending modified versions to the original dataset. This follows the same approach as the image domain implementation.
- Parameters:
dataset – Original clean dataset
frac_error – Fraction of samples to duplicate
n_errors – Exact number of duplicates (overrides frac_error)
duplicate_strategy – Type of duplicate to create
noise_level – Noise level for noisy duplicates
crop_ratio_range – Range of crop ratios for cropped duplicates
random_state – Random seed for reproducibility
name – Dataset name for logging
- cleanup_temp_dir() None[source]#
Remove the temporary directory containing synthetic duplicate files.