Hugging Face - Tutorial Summaries

From NoskeWiki
Jump to navigation Jump to search

About

This page is a child of: Hugging Face


So Hugging Face has some amazing tutorials, and I'm going to try summarize what I learned.


Hugging Face - Dataset Tutorial

There's a great tutorial for Hugging Face Datasets...

I've summarized it below:

Python and PyTorch Warning

It's important that all these installations are done in the same Python environment. PyTorch and others don't seem to play well with later versions of Python, so we are using Python 3.9.18.

$ python --version
$ brew update
$ brew install pyenv
$ pyenv install 3.9.18
$ python --version

... and you may have to add something to your .zshrc / .bashrc file (ask ChatGPT). Now it's time for tricky pytorch:

$ pip install torch torchvision torchaudio
$ python -c "import torch; print(torch.__version__)"

NOTE: For my MacBookPro 2023, the pip command (`pip3 install torch torchvision torchaudio`) didn't work initially, so I had to install Anaconda (which was weird also), then install "Python 3.9.18" (also tricky) then run:

$ conda create -n pytorch-env python=3.9.18
$ conda activate pytorch-env
$ conda install pytorch::pytorch torchvision torchaudio -c pytorch

Test with:

$ python -c "import torch; print(torch.__version__)"

Install and Test 'Datasets'

pip install datasets
python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])"

pip install transformers

Sample Python Script to Load a Dataset

from datasets import load_dataset
dataset = load_dataset("rotten_tomatoes")
print(dataset)              # Prints: "DatasetDict({train: ... , validation: ..., test: ...})"

training_dataset = dataset['train'] # Or `load_dataset("rotten_tomatoes", split="train")`
print(training_dataset)     # Prints: "Dataset({features: ['text', 'label'], num_rows: 8530})"
print(training_dataset[0])  # Prints: "'text': 'the rock is destined ...', 'label': 1}"
print(training_dataset[-1]['text'])  # Prints: "things got weird (last entry) ..."

Know Your Dataset

There are two types of dataset objects, a regular Dataset and then an IterableDataset... use the latter when too big to fit in memory... it will look like this:

from datasets import load_dataset

iterable_dataset = load_dataset("food101", split="train", streaming=True)
for example in iterable_dataset:
    print(example)
    break

Preprocess > Tokenizing Text

In addition to loading datasets, 🤗 datasets' other main goal is to offer a diverse set of preprocessing functions to get a dataset into an appropriate format for training with your machine learning framework. Models cannot process raw text, so you’ll need to convert the text into numbers. Tokenization provides a way to do this by dividing text into individual words called tokens. Tokens are finally converted to numbers.

from transformers import AutoTokenizer
from datasets import load_dataset
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
dataset = load_dataset("rotten_tomatoes", split="train")  # Prints: "{'input_ids': [101, 199, ...],  'token_type_ids': [0, 0, ...], 'attention_mask': [1, 1, ...]}"
# Demo tokenizer on the first row of text in the dataset:
print(tokenizer(dataset[0]["text"]), '\n\n')
# Map whole dataset:
def tokenization(example):  # Function to help tokenize whole dataset.
    return tokenizer(example["text"])
dataset_tokenized = dataset.map(tokenization, batched=True)
print(dataset_tokenized, '\n\n')  # Prints: "Dataset({ features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'], num_rows: 8530})"
dataset_tokenized.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "label"])
dataset_tokenized.format['type']
print(dataset_tokenized[-1])  # Prints: "{'label': tensor(0), 'input_ids': tensor([  101,  2477, ..."

Evaluate Predictions

from datasets import list_metrics
from datasets import load_metric
metrics_list = list_metrics()

# See what metrics are available:
print(len(metrics_list))  # Prints: "184".
print(metrics_list)       # Prints: "['accuracy', 'bertscore', 'bleu', 'bleurt', ...]".

# Pick our metric:
metric = load_metric('glue', 'mrpc')  # Load the metric associated with the MRPC dataset from the GLUE benchmark.
print(metric.inputs_description)  # Prints: "Compute GLUE evaluation metric associated to each GLUE dataset. Args: predictions: ... references: ... Returns: "accuracy", f1", "pearson", "spearmanr", "matthews_correlation" ".

# Once you have loaded a metric, you are ready to use it to evaluate a model's predictions. Provide the model predictions and references to compute():
#   model_predictions = model(model_inputs)
#   final_score = metric.compute(predictions=model_predictions, references=gold_references)

Create Dataset

Note: If working with image or audio files, you need to create a directory and metadata.csv to specify the images to load.

There are a few methods to create a dataset... `from_generator()` , `from_dict()`:

  1. Using from_generator():
from datasets import Dataset
def gen():
    yield {"pokemon": "bulbasaur", "type": "grass"}
    yield {"pokemon": "squirtle", "type": "water"}
ds = Dataset.from_generator(gen)
ds[0]
  1. Using from_dict():
from datasets import Dataset
ds = Dataset.from_dict({"pokemon": ["bulbasaur", "squirtle"], "type": ["grass", "water"]})
ds[0]

... and there is also a loading script option. See: Create a dataset loading script.

Share a Dataset to the Hub

You can share/upload a dataset Manually (drag and drop on web UI) or Programmatically. Either way you'll need to start by creating a HuggingFace account.

Manually:

  • In HuggingFace click on your profile > New Dataset (and name a new repo).
  • Click Files and versions tab to add a file. Select Add file - we support many text data extensions (.csv, .csv, .json, .jsonl, .txt) plus some audio and image formats (.mp3, and .jpg). Drag and drop your datasets and commit.
  • Click Datacards fill out the README.md with the template and commit.

Programmatically:

$ pip install huggingface_hub
$ huggingface-cli login


Links