# Getting Started

This tutorial provides a step-by-step guide on how to use `selfclean-audio` to detect issues in an example dataset.

## 1. Installation

First, you need to install the `selfclean-audio` library. You can do this using pip:

```bash
pip install selfclean-audio
```

## 2. Prepare your dataset

For this tutorial, we will use a small example dataset of audio files. You can create a directory with a few audio files (e.g., `.wav` files) and a `metadata.csv` file with the following format:

```
filename,label
audio1.wav,cat
audio2.wav,dog
audio3.wav,cat
...
```

## 3. Run issue detection

Now, you can use the `selfclean-audio` command-line interface (CLI) to detect issues in your dataset. Heres an example command to detect near-duplicates:

```bash
python -m selfclean_audio \
  --config config/templates/file_template.py \
  --output-dir outputs/my_dataset_duplicates \
  datamodule.train_dataset.root=/path/to/your/dataset \
  datamodule.train_dataset.meta_path=/path/to/your/metadata.csv \
  MODEL_TYPE=beats \
  ISSUE_TYPE=duplicates \
  near_duplicate_method=embedding_distance
```

Replace `/path/to/your/dataset` and `/path/to/your/metadata.csv` with the actual paths to your dataset directory and metadata file.

## 4. Analyze the results

The results will be saved in the `outputs/my_dataset_duplicates` directory. You can find the following files:

-   `Score-duplicates.csv`: A CSV file with the duplicate scores for each pair of audio files.
-   `config.log`: A log file with the configuration used for the run.

You can then use the scores in `Score-duplicates.csv` to identify and remove the near-duplicates from your dataset.