`selfclean_audio.selfclean_audio`#

Members

`PretrainingSSL`	Enum of supported self-supervised learning pretraining models.
`SelfCleanAudio`	Main class to clean audio datasets using pretrained SSL models and distance-based cleaner.
`create_memmap`	Create a memory-mapped numpy array for storing embeddings.
`create_memmap_path`	Ensure memmap directory exists or create a temporary directory.
`embed_dataset`	Compute embeddings for all samples in a dataloader using a model.
`extract_temporal_stats_batch`	Extracts statistical features from a [N, T, D] tensor of audio embeddings.

class selfclean_audio.selfclean_audio.PretrainingSSL(value)[source]#: Enum of supported self-supervised learning pretraining models.

class selfclean_audio.selfclean_audio.SelfCleanAudio(distance_function_path: str = 'sklearn.metrics.pairwise.', distance_function_name: str = 'cosine_similarity', chunk_size: int = 100, precision_type_distance: type = <class 'numpy.float32'>, memmap: bool = True, memmap_path: str | ~pathlib.Path | None = None, plot_distribution: bool = False, plot_top_N: int | None = None, output_path: str | None = None, figsize: tuple = (10, 8), pretraining_ssl: ~selfclean_audio.selfclean_audio.PretrainingSSL = PretrainingSSL.BEATS, model_path: str | None = None, off_topic_method: str = 'lad', off_topic_params: dict | None = None, near_duplicate_method: str = 'embedding_distance', near_duplicate_params: dict | None = None, label_error_method: str = 'intra_extra_distance', label_error_params: dict | None = None, issues_to_detect: list | None = None, random_seed: int = 42, device: ~torch.device | str = 'cuda', lora_enable: bool = False, lora_r: int = 8, lora_alpha: int = 16, lora_dropout: float = 0.05, adapt_epochs: int = 0, adapt_lr: float = 0.0001, adapt_weight_decay: float = 0.0, adapt_temperature: float = 0.2, adapt_projection_dim: int = 256, adapt_max_steps: int | None = None, adapt_objective: str = 'infonce', vicreg_sim_coeff: float = 25.0, vicreg_var_coeff: float = 25.0, vicreg_cov_coeff: float = 1.0, adapt_sample_rate: int | None = None, adapt_strong_aug: bool = True, adapt_time_shift_max: float = 0.1, adapt_add_noise_snr_db: float = 15.0, adapt_tempo_min: float = 0.9, adapt_tempo_max: float = 1.1, adapt_pitch_semitones: float = 2.0, adapt_reverb_prob: float = 0.3, adapt_eq_prob: float = 0.4, adapt_time_mask_prob: float = 0.5, adapt_time_mask_max_ratio: float = 0.2, gradient_accumulation_steps: int = 1, **kwargs)[source]#

Main class to clean audio datasets using pretrained SSL models and distance-based cleaner.

Initialize SelfCleanAudio with model and cleaning parameters.

Parameters:

distance_function_path (str) – Module path for distance function.
distance_function_name (str) – Distance function name.
chunk_size (int) – Size of chunks to process.
precision_type_distance (type) – Precision for distance calculation.
memmap (bool) – Use memory-mapped arrays for embeddings.
memmap_path (Path|str|None) – Path for memmap storage.
plot_distribution (bool) – Whether to plot distance distribution.
plot_top_N (int|None) – Top N to plot.
output_path (str|None) – Path for outputs.
figsize (tuple) – Figure size for plots.
pretraining_ssl (PretrainingSSL) – SSL pretraining model enum.
model_path (str|None) – Path to pretrained SSL model. If None, will try environment variable SELFCLEAN_AUDIO_MODEL_PATH.
off_topic_method (str) – Off-topic detection method (“lad”, “quantile”, “isolation_forest”, “cleanlab”).
off_topic_params (dict|None) – Parameters for the off-topic detection method.
near_duplicate_method (str) – Near duplicate detection method (“embedding_distance”, “cleanlab”, “dejavu”).
near_duplicate_params (dict|None) – Parameters for the near duplicate detection method.
label_error_method (str) – Label error detection method (“intra_extra_distance”, “cleanlab”).
label_error_params (dict|None) – Parameters for the label error detection method.
random_seed (int) – Random seed for reproducibility.
device (torch.device|str) – Device for model inference.
**kwargs – Additional arguments.

run_on_dataloader(dataloader: DataLoader, issues_to_detect: list | None = None, apply_l2_norm: bool = False)[source]#

Detect issues in dataset by running the cleaner on a dataloader.

Parameters:

dataloader (DataLoader) – PyTorch DataLoader with audio data.
issues_to_detect (list[IssueTypes]|None) – Issues to detect.
apply_l2_norm (bool) – Whether to L2 normalize embeddings.

Returns:

Predicted issues mask or results.

Return type:

np.ndarray

selfclean_audio.selfclean_audio.create_memmap(memmap_path: Path, memmap_file_name: str, len_dataset: int, *dims)[source]#

Create a memory-mapped numpy array for storing embeddings.

Parameters:

memmap_path (Path) – Directory to store memmap file.
memmap_file_name (str) – Filename for memmap.
len_dataset (int) – Number of samples.
*dims – Dimensions of each embedding.

Returns:

Memory-mapped numpy array.

Return type:

np.memmap

selfclean_audio.selfclean_audio.create_memmap_path(memmap_path: str | Path | None) → Path[source]#

Ensure memmap directory exists or create a temporary directory.

Parameters:: memmap_path (str|Path|None) – Desired memmap directory or None.
Returns:: Path to memmap directory.
Return type:: Path

selfclean_audio.selfclean_audio.embed_dataset(dataloader: DataLoader, model: Module, normalize: bool = False, memmap: bool = True, memmap_path: str | Path | None = None, tqdm_desc: str | None = None, device: device | str = 'cpu', workdir: str = './outputs', save_plots: bool = False) → tuple[ndarray | memmap, ndarray, Tensor, Tensor][source]#

Compute embeddings for all samples in a dataloader using a model.

Parameters:

dataloader (DataLoader) – Dataset loader.
model (nn.Module) – Pretrained model for embeddings.
normalize (bool) – Normalize embeddings if True.
memmap (bool) – Use memory-mapped storage.
memmap_path (Path|str|None) – Path for memory map.
tqdm_desc (str|None) – Description for progress bar.
device (torch.device|str) – Device for computation.

Returns:

(embedding array, array of file paths, tensor of labels, tensor of noisy_labels)

Return type:

tuple

selfclean_audio.selfclean_audio.extract_temporal_stats_batch(embeddings: Tensor)[source]#

Extracts statistical features from a [N, T, D] tensor of audio embeddings.

Parameters:: embeddings (torch.Tensor) – shape [N, T, D]
Returns:: shape [N, 8], each row is the feature vector for one sample.
Return type:: torch.Tensor

selfclean_audio.selfclean_audio#

This Page

`selfclean_audio.selfclean_audio`#