# Classificador g√©neres de m√∫sica

Com a exemple d'utilitzaci√≥ de models que treballen en audio anem a fer un classificador de g√®neres de m√∫sica. Per a aix√≤ farem servir el dataset GTZAN, un dataset de 1000 mostres d'√†udio etiquetades amb el g√®nere de la m√∫sica.

## Instal¬∑laci√≥ de llibreries
Per a poder executar aquest notebook necessitarem instal¬∑lar les seg√ºents llibreries:

In [38]:
%pip install transformers datasets librosa soundfile torch accelerate evaluate youtube-dl

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## Carreguem el dataset

In [3]:
from datasets import load_dataset

gtzan = load_dataset("marsyas/gtzan", "all", trust_remote_code=True)
gtzan

DatasetDict({
    train: Dataset({
        features: ['file', 'audio', 'genre'],
        num_rows: 999
    })
})

Com podem veure, el dataset consta de 99 mostres d'√†udio etiquetades amb el g√®nere de la m√∫sica.

Els audios estan en format de 22050 Hz, per a poder-los processar amb el model necessitarem convertir-los a un format que pugui ser processat pel model (normalment 16kHz). Aix√≤ ho farem amb la classe `Audio` de la llibreria datasets.

In [4]:
from datasets import Audio

gtzan = gtzan.cast_column("audio", Audio(sampling_rate=16000))

## Creaci√≥ del dataset de `test`

Per a poder avaluar el model necessitarem un dataset de test. Per a aix√≤ dividirem el dataset en dos parts, una per a entrenar el model i una altra per a avaluar-lo.

In [5]:
gtzan = gtzan["train"].train_test_split(seed=42, shuffle=True, test_size=0.1)
gtzan

DatasetDict({
    train: Dataset({
        features: ['file', 'audio', 'genre'],
        num_rows: 899
    })
    test: Dataset({
        features: ['file', 'audio', 'genre'],
        num_rows: 100
    })
})

Una vegada separat el dataset en dos parts, el dataset de test contindr√† 100 mostres d'√†udio.

A continuaci√≥ mostrarem una mostra del dataset de test.

In [6]:
gtzan['train'][0]

{'file': '/home/jupyter-carlesgm/.cache/huggingface/datasets/downloads/extracted/7351848f2b55271153f05ec967056b01e90a80dc93690ceb485b797291327357/genres/pop/pop.00098.wav',
 'audio': {'path': '/home/jupyter-carlesgm/.cache/huggingface/datasets/downloads/extracted/7351848f2b55271153f05ec967056b01e90a80dc93690ceb485b797291327357/genres/pop/pop.00098.wav',
  'array': array([ 0.0873509 ,  0.20183384,  0.4790867 , ..., -0.18743178,
         -0.23294401, -0.13517427]),
  'sampling_rate': 16000},
 'genre': 7}

De cada mostra del dataset de test podem veure que tenim les seg√ºents dades:
- `audio`: El path a l'arxiu d'√†udio.
- `array`: L'√†udio en format d'array. El valor de cada element de l'array representa l'amplitud de l'ona en un instant de temps. Com el valor de samplig √©s de 16000 Hz, aquest array tindr√† 16000 elements per segon.
- `genre`: El g√®nere de la m√∫sica com a enter. Podem utilitzar el m√©tode `int2str()` del `feature` _genre()_ per a obtenir el g√®nere en format llegible.

In [9]:
int2str = gtzan["train"].features["genre"].int2str
int2str(gtzan['train'][0]['genre'])

'pop'

## Testeig del model sense entrenar

Abans de comen√ßar a entrenar el model, testejarem el model sense entrenar per a veure com es comporta. Utilitzarem el model `distilhubert`, un model pre-entrenat per a classificar audio i lleuger de refinar.

Per a utilitzar el model farem servir la classe `pipeline` de la llibreria transformers.

In [11]:
from transformers import pipeline
import torch

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

classifier = pipeline(
    "audio-classification", model="ntu-spml/distilhubert",
    batch_size=16,
    device=device
)

Some weights of HubertForSequenceClassification were not initialized from the model checkpoint at ntu-spml/distilhubert and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda


In [12]:
classifier(gtzan['train'][0]['audio'])

[{'score': 0.5036259889602661, 'label': 'LABEL_0'},
 {'score': 0.4963740110397339, 'label': 'LABEL_1'}]

Anem a calcular la precisi√≥ del model sense entrenar. Per a aix√≤ farem servir el dataset de test.

El primer que farem ser√† calcular les prediccions del model per a cada mostra del dataset de test.

A continuaci√≥ mostrarem les prediccions del model per a la primera mostra del dataset de test.

In [13]:
predictions = [classifier(sample['audio']) for sample in gtzan['test']]
predictions[0]

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


[{'score': 0.5022980570793152, 'label': 'LABEL_0'},
 {'score': 0.49770188331604004, 'label': 'LABEL_1'}]

Un cop tenim les prediccions del model, les compararem amb les etiquetes reals per a calcular la precisi√≥ del model.

A continuaci√≥ mostrarem la precisi√≥ del model.

In [14]:
from sklearn.metrics import accuracy_score

y_true = [f"LABEL_{sample['genre']}" for sample in gtzan['test']]
y_pred = [prediction[0]['label'] for prediction in predictions]

accuracy_score(y_true, y_pred)

0.07


## Entrenament del model

Com podem veure, el model sense entrenar t√© una precisi√≥ del 10%, molt poc. Aix√≤ √©s degut a que el model no ha estat entrenat amb el dataset GTZAN.

Per a entrenar el model farem servir la classe `Trainer` de la llibreria transformers. Aquesta classe ens permet entrenar models de manera senzilla i eficient.

Mentre que amb altres models necessitem un `Tokenizer` en aquest cas farem servir un `feature_extractor`. Aquesta classe ens permetr√† processar les mostres d'√†udio per a convertir-les en un format que pugui ser processat pel model.

A continuaci√≥ crearem el `feature_extractor` que farem servir per a entrenar el model.

In [15]:
from transformers import AutoFeatureExtractor

model_id = "ntu-spml/distilhubert"
feature_extractor = AutoFeatureExtractor.from_pretrained(
    model_id, do_normalize=True, return_attention_mask=True
)

A continuaci√≥ processarrem les mostres d'√†udio, convertint-les en un format que pugui ser processat pel model. En el nostre cas, retallarem les mostres d'√†udio a 30 segons utilitzant les opcions `max_length` i `padding` del `feature_extractor` i llevarem les dades que no ens interessen del dataset amb el m√®tode `remove_columns`.

In [16]:
max_duration = 30.0


def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=int(feature_extractor.sampling_rate * max_duration),
        truncation=True,
        return_attention_mask=True,
    )
    return inputs

In [17]:
gtzan_encoded = gtzan.map(
    preprocess_function,
    remove_columns=["audio", "file"],
    batched=True,
    batch_size=16,
    num_proc=1,
)
gtzan_encoded

DatasetDict({
    train: Dataset({
        features: ['genre', 'input_values', 'attention_mask'],
        num_rows: 899
    })
    test: Dataset({
        features: ['genre', 'input_values', 'attention_mask'],
        num_rows: 100
    })
})

Renomenarem la columna `genre` a `label` per a que el `Trainer` pugui identificar-la com a columna de labels.

In [18]:
gtzan_encoded = gtzan_encoded.rename_column("genre", "label")

Per √∫ltim abans de comen√ßar a entrenar el model, crearem un diccionari en les correspond√®ncies entre els noms dels g√®neres i els seus valors enters, per a que el `Trainer` pugui identificar-los i permetre un canvi r√†pid entre els dos formats.

In [19]:
id2label = {
    str(i): int2str(i)
    for i in range(len(gtzan_encoded["train"].features["label"].names))
}
label2id = {v: k for k, v in id2label.items()}

id2label["7"]

'pop'

## Entrenament del model

A continuaci√≥ crearem el model que anem a entrenar.

In [20]:
from transformers import AutoModelForAudioClassification

model = AutoModelForAudioClassification.from_pretrained(
    model_id, num_labels=len(id2label)
)

Some weights of HubertForSequenceClassification were not initialized from the model checkpoint at ntu-spml/distilhubert and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


A continuaci√≥ crearem els `TrainerArguments`, que ens permetr√† configurar el `Trainer` per a entrenar el model.

In [21]:
from transformers import TrainingArguments

model_name = model_id.split("/")[-1]
batch_size = 8
gradient_accumulation_steps = 1
num_train_epochs = 3

training_args = TrainingArguments(
    f"{model_name}-finetuned-gtzan",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    warmup_ratio=0.1,
    logging_steps=5,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True,
    report_to="none",
)



A continuaci√≥ crearem el `Trainer`, classe que s'encarregar√† de fer l'entrenament del model.

In [22]:
import evaluate
import numpy as np

metric = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions"""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=gtzan_encoded["train"],
    eval_dataset=gtzan_encoded["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

trainer.train()

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,1.7902,1.697441,0.55
2,1.2559,1.262839,0.67
3,1.2586,1.177871,0.72


TrainOutput(global_step=339, training_loss=1.5991955141050627, metrics={'train_runtime': 593.5659, 'train_samples_per_second': 4.544, 'train_steps_per_second': 0.571, 'total_flos': 1.8401964823872e+17, 'train_loss': 1.5991955141050627, 'epoch': 3.0})

In [None]:
trainer.evaluate()

{'eval_loss': 1.1778711080551147,
 'eval_accuracy': 0.72,
 'eval_runtime': 18.0335,
 'eval_samples_per_second': 5.545,
 'eval_steps_per_second': 0.721,
 'epoch': 3.0}

## Utilitzaci√≥ del model entrenat

El primer que farem ser√† crear el pipelineamb el model que hem entrenat per a classificar g√®neres de m√∫sica.

In [25]:
music_classifier = pipeline(
    "audio-classification",
    model=model,
    feature_extractor=feature_extractor,
    batch_size=16,
    device=device
)

Device set to use cuda


Com alternativa podem utilitzar un model semblant al nostre, ja pre-entrenat

In [None]:
music_classifier = pipeline(
    "audio-classification",device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

    model="ihanif/distilhubert-music-gtzan-classification",
    batch_size=16,
    device=device
)

Device set to use cuda


## Classificaci√≥ de can√ßons

Un cop creat el noy pipeline podem fer-lo servir per a classificar una can√ß√≥. Aquest m√®tode rep com a par√†metre el path a la can√ß√≥ que volem classificar i retorna el g√®nere de la can√ß√≥.

Previament necessitarem m√∫sica per a poder-la classificar. A continuaci√≥ teniu alguns enlla√ßos de m√∫sica lliure per a poder-la descarregar i fer servir en aquest notebook:

- [M√∫sica cl√†ssica](https://freemusicarchive.org/genre/Classical)
- [M√∫sica electr√≤nica](https://freemusicarchive.org/genre/Electronic)
- [M√∫sica pop](https://freemusicarchive.org/genre/Pop)
- [M√∫sica rock](https://freemusicarchive.org/genre/Rock)
- [M√∫sica jazz](https://freemusicarchive.org/genre/Jazz)

Tamb√© proporcionem enlla√ßos directes que a alguns `samples` que podeu descarregar en la llibreria requests:


In [35]:
import requests

musica = [
    "https://archive.org/download/the-offspring-albums/1994-SMASH/08%20-%20Self%20Esteem.mp3",
    
    "https://archive.org/download/daft-punk-instant-crush-feat.-julian-casablancas/Daft%20Punk%20-%20Instant%20Crush%20%28Feat.%20Julian%20Casablancas%29.mp3",

    "https://archive.org/download/IBR_1515/01.%20Debbie%20Does%20Dallas%20theme%20%28cover%29.mp3",

    "https://archive.org/download/100-hits-rock-jukebox-2016/100%20Hits%20-%20Rock%20Jukebox%20%5BDisc%201%5D%20%282016%29/01.%20Don%27t%20Stop%20Believin%27.mp3",
        
    "https://archive.org/download/geniesduclassique_vol1no12/1-09%20Concerto%20Brandeburghese%20No.3%20-%20Adagio.mp3",

    "https://archive.org/download/don-omar/Don%20Omar/2003%20-%20The%20Last%20Don%20%28European%20Edition%29/02.%20Dale%20Don%20M%C3%A1s%20Duro%20%28Feat.%20Glory%29.mp3",

    "http://cdn-data.motu.com/media/mx4/demo-audio/mp3/track01.mp3"
]

filenames = []

for url in musica:
    filename = url.split("/")[-1]

    with open(filename, "wb") as f:
        try:
            f.write(requests.get(url).content)
            filenames.append(filename)
        except Exception as e:
            print(f"Error downloading {url}: {e}")

filenames

Error downloading https://archive.org/download/the-offspring-albums/1994-SMASH/08%20-%20Self%20Esteem.mp3: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
Error downloading https://archive.org/download/daft-punk-instant-crush-feat.-julian-casablancas/Daft%20Punk%20-%20Instant%20Crush%20%28Feat.%20Julian%20Casablancas%29.mp3: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
Error downloading https://archive.org/download/IBR_1515/01.%20Debbie%20Does%20Dallas%20theme%20%28cover%29.mp3: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
Error downloading https://archive.org/download/100-hits-rock-jukebox-2016/100%20Hits%20-%20Rock%20Jukebox%20%5BDisc%201%5D%20%282016%29/01.%20Don%27t%20Stop%20Believin%27.mp3: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
Error downloading https://archive.org/download/geniesduclassique_vol1no12/1-09%20Concerto%20Brandeburghese%20No.3%20-

['track01.mp3']

In [39]:
for filename in filenames:
    print(f"Classificaci√≥ de {filename}: {music_classifier(filename)}")


Classificaci√≥ de track01.mp3: [{'score': 0.9156602621078491, 'label': 'hiphop'}, {'score': 0.06189543008804321, 'label': 'jazz'}, {'score': 0.008564403280615807, 'label': 'reggae'}, {'score': 0.0032467523124068975, 'label': 'metal'}, {'score': 0.0030874384101480246, 'label': 'blues'}, {'score': 0.0026964631397277117, 'label': 'disco'}, {'score': 0.002555758925154805, 'label': 'classical'}, {'score': 0.0016435383586212993, 'label': 'pop'}, {'score': 0.0003317997034173459, 'label': 'country'}, {'score': 0.0003180901112500578, 'label': 'rock'}]
