selfclean_audio.selfclean_audio#

Members

PretrainingSSL

Enum of supported self-supervised learning pretraining models.

SelfCleanAudio

Main class to clean audio datasets using pretrained SSL models and distance-based cleaner.

create_memmap

Create a memory-mapped numpy array for storing embeddings.

create_memmap_path

Ensure memmap directory exists or create a temporary directory.

embed_dataset

Compute embeddings for all samples in a dataloader using a model.

extract_temporal_stats_batch

Extracts statistical features from a [N, T, D] tensor of audio embeddings.

class selfclean_audio.selfclean_audio.PretrainingSSL(value)[source]#

Enum of supported self-supervised learning pretraining models.

class selfclean_audio.selfclean_audio.SelfCleanAudio(distance_function_path: str = 'sklearn.metrics.pairwise.', distance_function_name: str = 'cosine_similarity', chunk_size: int = 100, precision_type_distance: type = <class 'numpy.float32'>, memmap: bool = True, memmap_path: str | ~pathlib.Path | None = None, plot_distribution: bool = False, plot_top_N: int | None = None, output_path: str | None = None, figsize: tuple = (10, 8), pretraining_ssl: ~selfclean_audio.selfclean_audio.PretrainingSSL = PretrainingSSL.BEATS, model_path: str | None = None, off_topic_method: str = 'lad', off_topic_params: dict | None = None, near_duplicate_method: str = 'embedding_distance', near_duplicate_params: dict | None = None, label_error_method: str = 'intra_extra_distance', label_error_params: dict | None = None, issues_to_detect: list | None = None, random_seed: int = 42, device: ~torch.device | str = 'cuda', lora_enable: bool = False, lora_r: int = 8, lora_alpha: int = 16, lora_dropout: float = 0.05, adapt_epochs: int = 0, adapt_lr: float = 0.0001, adapt_weight_decay: float = 0.0, adapt_temperature: float = 0.2, adapt_projection_dim: int = 256, adapt_max_steps: int | None = None, adapt_objective: str = 'infonce', vicreg_sim_coeff: float = 25.0, vicreg_var_coeff: float = 25.0, vicreg_cov_coeff: float = 1.0, adapt_sample_rate: int | None = None, adapt_strong_aug: bool = True, adapt_time_shift_max: float = 0.1, adapt_add_noise_snr_db: float = 15.0, adapt_tempo_min: float = 0.9, adapt_tempo_max: float = 1.1, adapt_pitch_semitones: float = 2.0, adapt_reverb_prob: float = 0.3, adapt_eq_prob: float = 0.4, adapt_time_mask_prob: float = 0.5, adapt_time_mask_max_ratio: float = 0.2, gradient_accumulation_steps: int = 1, **kwargs)[source]#

Main class to clean audio datasets using pretrained SSL models and distance-based cleaner.

Initialize SelfCleanAudio with model and cleaning parameters.

Parameters:
  • distance_function_path (str) – Module path for distance function.

  • distance_function_name (str) – Distance function name.

  • chunk_size (int) – Size of chunks to process.

  • precision_type_distance (type) – Precision for distance calculation.

  • memmap (bool) – Use memory-mapped arrays for embeddings.

  • memmap_path (Path|str|None) – Path for memmap storage.

  • plot_distribution (bool) – Whether to plot distance distribution.

  • plot_top_N (int|None) – Top N to plot.

  • output_path (str|None) – Path for outputs.

  • figsize (tuple) – Figure size for plots.

  • pretraining_ssl (PretrainingSSL) – SSL pretraining model enum.

  • model_path (str|None) – Path to pretrained SSL model. If None, will try environment variable SELFCLEAN_AUDIO_MODEL_PATH.

  • off_topic_method (str) – Off-topic detection method (“lad”, “quantile”, “isolation_forest”, “cleanlab”).

  • off_topic_params (dict|None) – Parameters for the off-topic detection method.

  • near_duplicate_method (str) – Near duplicate detection method (“embedding_distance”, “cleanlab”, “dejavu”).

  • near_duplicate_params (dict|None) – Parameters for the near duplicate detection method.

  • label_error_method (str) – Label error detection method (“intra_extra_distance”, “cleanlab”).

  • label_error_params (dict|None) – Parameters for the label error detection method.

  • random_seed (int) – Random seed for reproducibility.

  • device (torch.device|str) – Device for model inference.

  • **kwargs – Additional arguments.

run_on_dataloader(dataloader: DataLoader, issues_to_detect: list | None = None, apply_l2_norm: bool = False)[source]#

Detect issues in dataset by running the cleaner on a dataloader.

Parameters:
  • dataloader (DataLoader) – PyTorch DataLoader with audio data.

  • issues_to_detect (list[IssueTypes]|None) – Issues to detect.

  • apply_l2_norm (bool) – Whether to L2 normalize embeddings.

Returns:

Predicted issues mask or results.

Return type:

np.ndarray

selfclean_audio.selfclean_audio.create_memmap(memmap_path: Path, memmap_file_name: str, len_dataset: int, *dims)[source]#

Create a memory-mapped numpy array for storing embeddings.

Parameters:
  • memmap_path (Path) – Directory to store memmap file.

  • memmap_file_name (str) – Filename for memmap.

  • len_dataset (int) – Number of samples.

  • *dims – Dimensions of each embedding.

Returns:

Memory-mapped numpy array.

Return type:

np.memmap

selfclean_audio.selfclean_audio.create_memmap_path(memmap_path: str | Path | None) Path[source]#

Ensure memmap directory exists or create a temporary directory.

Parameters:

memmap_path (str|Path|None) – Desired memmap directory or None.

Returns:

Path to memmap directory.

Return type:

Path

selfclean_audio.selfclean_audio.embed_dataset(dataloader: DataLoader, model: Module, normalize: bool = False, memmap: bool = True, memmap_path: str | Path | None = None, tqdm_desc: str | None = None, device: device | str = 'cpu', workdir: str = './outputs', save_plots: bool = False) tuple[ndarray | memmap, ndarray, Tensor, Tensor][source]#

Compute embeddings for all samples in a dataloader using a model.

Parameters:
  • dataloader (DataLoader) – Dataset loader.

  • model (nn.Module) – Pretrained model for embeddings.

  • normalize (bool) – Normalize embeddings if True.

  • memmap (bool) – Use memory-mapped storage.

  • memmap_path (Path|str|None) – Path for memory map.

  • tqdm_desc (str|None) – Description for progress bar.

  • device (torch.device|str) – Device for computation.

Returns:

(embedding array, array of file paths, tensor of labels, tensor of noisy_labels)

Return type:

tuple

selfclean_audio.selfclean_audio.extract_temporal_stats_batch(embeddings: Tensor)[source]#

Extracts statistical features from a [N, T, D] tensor of audio embeddings.

Parameters:

embeddings (torch.Tensor) – shape [N, T, D]

Returns:

shape [N, 8], each row is the feature vector for one sample.

Return type:

torch.Tensor