Sonotype Model

Sonotype Model#

Can you check the password using keyboard sounds?#

Introduction#

The TrAItor has attemped to break into the model and steal our sensitive information. But, you’ve intercepted an audio recording of the TrAItor typing their password. Can you decipher the password using only the keystroke sounds?

Objective#

Analyze the audio recordings to determine the exact password. Use audio analysis and pattern recognition to extract the password from the key press sounds. Submit the correct password to complete the challenge.

The training code in this notebook is based on the following GitHub gist by Seonjin Kim.

In what follows we train a HuBERT model to classify an augmented version of the dataset we created in the previous step. For your convenience, the data was packed into .parquet files and uploaded as a Hugging Face dataset to the Hugging Face Hub: 🤗christopher/sonotype.

HuBERT is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.

HuBERT was fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded using Wav2Vec2CTCTokenizer.

CRUCIBLE_API_KEY = ""
CHALLENGE = "sonotype"
CRUCIBLE_URL = "https://crucible.dreadnode.io"
CHALLENGE_URL = "https://sonotype.crucible.dreadnode.io"
ARTIFACT_FILES = ['recordings.tar.gz']

from datasets import load_dataset, Audio
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification, TrainingArguments, Trainer, EarlyStoppingCallback
import evaluate
import numpy as np
import torch

dset = load_dataset("christopher/sonotype", data_dir="data/augmented-keystrokes", split="train")
dset = dset.train_test_split(test_size=0.13, seed=42)

dset

DatasetDict({
    train: Dataset({
        features: ['audio', 'label'],
        num_rows: 6065
    })
    test: Dataset({
        features: ['audio', 'label'],
        num_rows: 907
    })
})

dset["train"][0]

{'audio': {'path': 'm_155.wav',
  'array': array([-0.00132515, -0.00076744, -0.00011702, ..., -0.00078037,
         -0.00104127, -0.0006526 ]),
  'sampling_rate': 22050},
 'label': 25}

model_id = "ntu-spml/distilhubert"
feature_extractor = AutoFeatureExtractor.from_pretrained(
    model_id, do_normalize=True, return_attention_mask=True
)

metric = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions"""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=int(feature_extractor.sampling_rate * max_duration),
        truncation=True,
        return_attention_mask=True,
    )
    return inputs
sampling_rate = feature_extractor.sampling_rate
max_duration = 1.0

dset = dset.cast_column("audio", Audio(sampling_rate=sampling_rate))
dset_encoded = dset.map(
    preprocess_function,
    remove_columns=["audio"],
    batched=True,
    batch_size=100,
    num_proc=1,
)

id2label_fn = dset["train"].features["label"].int2str

id2label = {
    str(i): id2label_fn(i)
    for i in range(len(dset_encoded["train"].features["label"].names))
}
label2id = {v: k for k, v in id2label.items()}
num_labels = len(id2label)

id2label["17"]

'enter'

Model Training#

model = AutoModelForAudioClassification.from_pretrained(
    model_id,
    ignore_mismatched_sizes=True,
    num_labels=num_labels,
    label2id=label2id,
    id2label=id2label,
)

model_name = model_id.split("/")[-1]
batch_size = 16
gradient_accumulation_steps = 1
num_train_epochs = 30

training_args = TrainingArguments(
    f"{model_name}",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    warmup_ratio=0.2,
    logging_steps=5,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True,
)

trainer = Trainer(
    model,
    training_args,
    train_dataset=dset_encoded["train"],
    eval_dataset=dset_encoded["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=4)]
)

trainer.train()

Some weights of HubertForSequenceClassification were not initialized from the model checkpoint at ntu-spml/distilhubert and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/mnt/1da05489-3812-4f15-a6e5-c8d3c57df39e/infosec/env/lib/python3.10/site-packages/transformers/training_args.py:1559: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
/tmp/ipykernel_3311231/3391942307.py:30: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
  trainer = Trainer(

[ 4560/11400 07:05 < 10:38, 10.70 it/s, Epoch 12/30]

Epoch	Training Loss	Validation Loss	Accuracy
1	3.697800	3.699639	0.056229
2	3.181400	3.131329	0.181918
3	2.568000	2.476310	0.456450
4	1.834700	1.768614	0.717751
5	1.161500	0.884553	0.940463
6	0.225400	0.259641	0.974642
7	0.030000	0.033597	0.996692
8	0.007900	0.010157	1.000000
9	0.003400	0.005673	1.000000
10	0.002200	0.004759	1.000000
11	0.001700	0.003772	1.000000
12	0.001200	0.003159	1.000000

TrainOutput(global_step=4560, training_loss=1.1897149357617947, metrics={'train_runtime': 425.9118, 'train_samples_per_second': 427.201, 'train_steps_per_second': 26.766, 'total_flos': 7.5137157902752e+16, 'train_loss': 1.1897149357617947, 'epoch': 12.0})

Prediction#

We now combine the logits of all password characters across the 5 tries at our disposal:

model = AutoModelForAudioClassification.from_pretrained(trainer.state.best_model_checkpoint)
dset = load_dataset("christopher/sonotype", data_dir="data/keystrokes", split="train")
logits = []
for data_dir in ["data/password1",
                 "data/password2",
                 "data/password3",
                 "data/password4",
                 "data/password5"
                ]:
    password = load_dataset("christopher/sonotype", data_dir=data_dir, split="train")
    password = password.cast_column("audio", Audio(sampling_rate=sampling_rate))
    password_encoded = password.map(
    preprocess_function,
    remove_columns=["audio"],
    batched=True,
    batch_size=100,
    num_proc=1,
)
    l = []
    for row in password_encoded:
        input_values = torch.tensor(row["input_values"]).unsqueeze(0)
        att = torch.tensor(row["attention_mask"]).unsqueeze(0)
        l.append(model(input_values, att).logits.detach())
    logits.append(torch.stack(l).squeeze())
password_tokens = torch.stack(logits).sum(axis=0)
password_tokens = torch.topk(password_tokens, 1).indices.numpy()
password_decoded = ""
for token in password_tokens:
    password_decoded += id2label[str(token[0])].replace("space", " ").replace("semic", ";").replace("enter", "").replace("period", ".")
password_decoded

'h4k3r k3ystr0k3s'

The password appears to be a leetspeak version of “hacker keystrokes”. Fitting.

import requests


def query(input_data):
    response = requests.post(
        f"{CHALLENGE_URL}/score",
        headers={"X-API-Key": CRUCIBLE_API_KEY},
        json={"data": input_data},
    )
    return response.json()

query("h4k3r k3ystr0k3s")

{'error': 'Incorrect. Try again.'}

Surprisingly, and much to our dismay during the competition itself, the submission password seems to erroneously contain an additional character that was not in the audio. It turns out that the endpoint is expecting h4ck3r k3ystr0k3s instead of the decoded h4k3r k3ystr0k3s. Submitting the correct password does the trick:

query("h4ck3r k3ystr0k3s")["flag"][:25]

'gAAAAABnSh7gYlC9nKrTE3RiH'