Hugging Face - Tutorial Summaries
Contents
About
So Hugging Face has some amazing tutorials, and I'm going to try summarize what I learned.
Hugging Face - Dataset Tutorial
There's a great tutorial for Hugging Face Datasets...
I've summarized it below:
Python and PyTorch Warning
It's important that all these installations are done in the same Python environment. PyTorch and others don't seem to play well with later versions of Python, so we are using Python 3.9.18.
$ python --version $ brew update $ brew install pyenv $ pyenv install 3.9.18 $ python --version... and you may have to add something to your .zshrc / .bashrc file (ask ChatGPT). Now it's time for tricky pytorch:
$ pip install torch torchvision torchaudio $ python -c "import torch; print(torch.__version__)"NOTE: For my MacBookPro 2023, the pip command (`pip3 install torch torchvision torchaudio`) didn't work initially, so I had to install Anaconda (which was weird also), then install "Python 3.9.18" (also tricky) then run:
$ conda create -n pytorch-env python=3.9.18 $ conda activate pytorch-env $ conda install pytorch::pytorch torchvision torchaudio -c pytorchTest with:
$ python -c "import torch; print(torch.__version__)"Install and Test 'Datasets'
pip install datasets python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])" pip install transformersSample Python Script to Load a Dataset
from datasets import load_dataset dataset = load_dataset("rotten_tomatoes") print(dataset) # Prints: "DatasetDict({train: ... , validation: ..., test: ...})" training_dataset = dataset['train'] # Or `load_dataset("rotten_tomatoes", split="train")` print(training_dataset) # Prints: "Dataset({features: ['text', 'label'], num_rows: 8530})" print(training_dataset[0]) # Prints: "'text': 'the rock is destined ...', 'label': 1}" print(training_dataset[-1]['text']) # Prints: "things got weird (last entry) ..."Know Your Dataset
There are two types of dataset objects, a regular Dataset and then an IterableDataset... use the latter when too big to fit in memory... it will look like this:
from datasets import load_dataset iterable_dataset = load_dataset("food101", split="train", streaming=True) for example in iterable_dataset: print(example) breakPreprocess > Tokenizing Text
In addition to loading datasets, đ¤ datasets' other main goal is to offer a diverse set of preprocessing functions to get a dataset into an appropriate format for training with your machine learning framework. Models cannot process raw text, so youâll need to convert the text into numbers. Tokenization provides a way to do this by dividing text into individual words called tokens. Tokens are finally converted to numbers.
from transformers import AutoTokenizer from datasets import load_dataset tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") dataset = load_dataset("rotten_tomatoes", split="train") # Prints: "{'input_ids': [101, 199, ...], 'token_type_ids': [0, 0, ...], 'attention_mask': [1, 1, ...]}" # Demo tokenizer on the first row of text in the dataset: print(tokenizer(dataset[0]["text"]), '\n\n') # Map whole dataset: def tokenization(example): # Function to help tokenize whole dataset. return tokenizer(example["text"]) dataset_tokenized = dataset.map(tokenization, batched=True) print(dataset_tokenized, '\n\n') # Prints: "Dataset({ features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'], num_rows: 8530})" dataset_tokenized.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "label"]) dataset_tokenized.format['type'] print(dataset_tokenized[-1]) # Prints: "{'label': tensor(0), 'input_ids': tensor([ 101, 2477, ..."Evaluate Predictions
from datasets import list_metrics from datasets import load_metric metrics_list = list_metrics() # See what metrics are available: print(len(metrics_list)) # Prints: "184". print(metrics_list) # Prints: "['accuracy', 'bertscore', 'bleu', 'bleurt', ...]". # Pick our metric: metric = load_metric('glue', 'mrpc') # Load the metric associated with the MRPC dataset from the GLUE benchmark. print(metric.inputs_description) # Prints: "Compute GLUE evaluation metric associated to each GLUE dataset. Args: predictions: ... references: ... Returns: "accuracy", f1", "pearson", "spearmanr", "matthews_correlation" ". # Once you have loaded a metric, you are ready to use it to evaluate a model's predictions. Provide the model predictions and references to compute(): # model_predictions = model(model_inputs) # final_score = metric.compute(predictions=model_predictions, references=gold_references)Create Dataset
Note: If working with image or audio files, you need to create a directory and metadata.csv to specify the images to load.
There are a few methods to create a dataset... `from_generator()` , `from_dict()`:
- Using from_generator():
from datasets import Dataset def gen(): yield {"pokemon": "bulbasaur", "type": "grass"} yield {"pokemon": "squirtle", "type": "water"} ds = Dataset.from_generator(gen) ds[0]
- Using from_dict():
from datasets import Dataset ds = Dataset.from_dict({"pokemon": ["bulbasaur", "squirtle"], "type": ["grass", "water"]}) ds[0]... and there is also a loading script option. See: Create a dataset loading script.
You can share/upload a dataset Manually (drag and drop on web UI) or Programmatically. Either way you'll need to start by creating a HuggingFace account.
Manually:
- In HuggingFace click on your profile > New Dataset (and name a new repo).
- Click Files and versions tab to add a file. Select Add file - we support many text data extensions (.csv, .csv, .json, .jsonl, .txt) plus some audio and image formats (.mp3, and .jpg). Drag and drop your datasets and commit.
- Click Datacards fill out the README.md with the template and commit.
Programmatically:
$ pip install huggingface_hub $ huggingface-cli login
- The second command will prompt you to go to huggingface.co/settings/tokens and generate a write token for yourself.
Links
- HuggingFace.co - Official website
- Hugging Face Wikipedia - Wikipedia entry.