Getting Started#
This tutorial provides a step-by-step guide on how to use selfclean-audio to detect issues in an example dataset.
1. Installation#
First, you need to install the selfclean-audio library. You can do this using pip:
pip install selfclean-audio
2. Prepare your dataset#
For this tutorial, we will use a small example dataset of audio files. You can create a directory with a few audio files (e.g., .wav files) and a metadata.csv file with the following format:
filename,label
audio1.wav,cat
audio2.wav,dog
audio3.wav,cat
...
3. Run issue detection#
Now, you can use the selfclean-audio command-line interface (CLI) to detect issues in your dataset. Heres an example command to detect near-duplicates:
python -m selfclean_audio \
--config config/templates/file_template.py \
--output-dir outputs/my_dataset_duplicates \
datamodule.train_dataset.root=/path/to/your/dataset \
datamodule.train_dataset.meta_path=/path/to/your/metadata.csv \
MODEL_TYPE=beats \
ISSUE_TYPE=duplicates \
near_duplicate_method=embedding_distance
Replace /path/to/your/dataset and /path/to/your/metadata.csv with the actual paths to your dataset directory and metadata file.
4. Analyze the results#
The results will be saved in the outputs/my_dataset_duplicates directory. You can find the following files:
Score-duplicates.csv: A CSV file with the duplicate scores for each pair of audio files.config.log: A log file with the configuration used for the run.
You can then use the scores in Score-duplicates.csv to identify and remove the near-duplicates from your dataset.