selfclean_audio.selfclean_audio#
Members
Enum of supported self-supervised learning pretraining models. |
|
Main class to clean audio datasets using pretrained SSL models and distance-based cleaner. |
|
Create a memory-mapped numpy array for storing embeddings. |
|
Ensure memmap directory exists or create a temporary directory. |
|
Compute embeddings for all samples in a dataloader using a model. |
|
Extracts statistical features from a [N, T, D] tensor of audio embeddings. |
- class selfclean_audio.selfclean_audio.PretrainingSSL(value)[source]#
Enum of supported self-supervised learning pretraining models.
- class selfclean_audio.selfclean_audio.SelfCleanAudio(distance_function_path: str = 'sklearn.metrics.pairwise.', distance_function_name: str = 'cosine_similarity', chunk_size: int = 100, precision_type_distance: type = <class 'numpy.float32'>, memmap: bool = True, memmap_path: str | ~pathlib.Path | None = None, plot_distribution: bool = False, plot_top_N: int | None = None, output_path: str | None = None, figsize: tuple = (10, 8), pretraining_ssl: ~selfclean_audio.selfclean_audio.PretrainingSSL = PretrainingSSL.BEATS, model_path: str | None = None, off_topic_method: str = 'lad', off_topic_params: dict | None = None, near_duplicate_method: str = 'embedding_distance', near_duplicate_params: dict | None = None, label_error_method: str = 'intra_extra_distance', label_error_params: dict | None = None, issues_to_detect: list | None = None, random_seed: int = 42, device: ~torch.device | str = 'cuda', lora_enable: bool = False, lora_r: int = 8, lora_alpha: int = 16, lora_dropout: float = 0.05, adapt_epochs: int = 0, adapt_lr: float = 0.0001, adapt_weight_decay: float = 0.0, adapt_temperature: float = 0.2, adapt_projection_dim: int = 256, adapt_max_steps: int | None = None, adapt_objective: str = 'infonce', vicreg_sim_coeff: float = 25.0, vicreg_var_coeff: float = 25.0, vicreg_cov_coeff: float = 1.0, adapt_sample_rate: int | None = None, adapt_strong_aug: bool = True, adapt_time_shift_max: float = 0.1, adapt_add_noise_snr_db: float = 15.0, adapt_tempo_min: float = 0.9, adapt_tempo_max: float = 1.1, adapt_pitch_semitones: float = 2.0, adapt_reverb_prob: float = 0.3, adapt_eq_prob: float = 0.4, adapt_time_mask_prob: float = 0.5, adapt_time_mask_max_ratio: float = 0.2, gradient_accumulation_steps: int = 1, **kwargs)[source]#
Main class to clean audio datasets using pretrained SSL models and distance-based cleaner.
Initialize SelfCleanAudio with model and cleaning parameters.
- Parameters:
distance_function_path (str) – Module path for distance function.
distance_function_name (str) – Distance function name.
chunk_size (int) – Size of chunks to process.
precision_type_distance (type) – Precision for distance calculation.
memmap (bool) – Use memory-mapped arrays for embeddings.
memmap_path (Path|str|None) – Path for memmap storage.
plot_distribution (bool) – Whether to plot distance distribution.
plot_top_N (int|None) – Top N to plot.
output_path (str|None) – Path for outputs.
figsize (tuple) – Figure size for plots.
pretraining_ssl (PretrainingSSL) – SSL pretraining model enum.
model_path (str|None) – Path to pretrained SSL model. If None, will try environment variable
SELFCLEAN_AUDIO_MODEL_PATH.off_topic_method (str) – Off-topic detection method (“lad”, “quantile”, “isolation_forest”, “cleanlab”).
off_topic_params (dict|None) – Parameters for the off-topic detection method.
near_duplicate_method (str) – Near duplicate detection method (“embedding_distance”, “cleanlab”, “dejavu”).
near_duplicate_params (dict|None) – Parameters for the near duplicate detection method.
label_error_method (str) – Label error detection method (“intra_extra_distance”, “cleanlab”).
label_error_params (dict|None) – Parameters for the label error detection method.
random_seed (int) – Random seed for reproducibility.
device (torch.device|str) – Device for model inference.
**kwargs – Additional arguments.
- selfclean_audio.selfclean_audio.create_memmap(memmap_path: Path, memmap_file_name: str, len_dataset: int, *dims)[source]#
Create a memory-mapped numpy array for storing embeddings.
- selfclean_audio.selfclean_audio.create_memmap_path(memmap_path: str | Path | None) Path[source]#
Ensure memmap directory exists or create a temporary directory.
- Parameters:
memmap_path (str|Path|None) – Desired memmap directory or None.
- Returns:
Path to memmap directory.
- Return type:
Path
- selfclean_audio.selfclean_audio.embed_dataset(dataloader: DataLoader, model: Module, normalize: bool = False, memmap: bool = True, memmap_path: str | Path | None = None, tqdm_desc: str | None = None, device: device | str = 'cpu', workdir: str = './outputs', save_plots: bool = False) tuple[ndarray | memmap, ndarray, Tensor, Tensor][source]#
Compute embeddings for all samples in a dataloader using a model.
- Parameters:
dataloader (DataLoader) – Dataset loader.
model (nn.Module) – Pretrained model for embeddings.
normalize (bool) – Normalize embeddings if True.
memmap (bool) – Use memory-mapped storage.
memmap_path (Path|str|None) – Path for memory map.
tqdm_desc (str|None) – Description for progress bar.
device (torch.device|str) – Device for computation.
- Returns:
(embedding array, array of file paths, tensor of labels, tensor of noisy_labels)
- Return type:
- selfclean_audio.selfclean_audio.extract_temporal_stats_batch(embeddings: Tensor)[source]#
Extracts statistical features from a [N, T, D] tensor of audio embeddings.
- Parameters:
embeddings (torch.Tensor) – shape [N, T, D]
- Returns:
shape [N, 8], each row is the feature vector for one sample.
- Return type: