PyTorch implementation of the BERT model

The official TensorFlow implementation of Google's NLP models is strong, and now, its PyTorch version is here! With a simple run of the conversion script, you can get a PyTorch model with similar or better results than the original.

Last week, Google's strongest NLP model, BERT, open sourced the official TensorFlow code and pre-trained model, which attracted a lot of attention.

Now, here are the benefits for PyTorch users: a team called Hugging Face recently released an op-for-op PyTorch reimplementation of Google’s official TensorFlow library for the BERT model:

https://github.com/huggingface/pytorch-pretrained-BERT

This implementation can load any pretrained TensorFlow checkpoint (specifically Google's official pretrained models) for BERT and provide a conversion script.

The number of parameters for the BERT-base and BERT-large models are 110M and 340M, respectively, and it is difficult to fine-tune them on a single GPU with the recommended batch size for good performance. To help fine-tune the model, this repo also provides 3 techniques that can be activated in the fine-tuning script: gradient-accumulation, multi-GPU, and distributed training.

The result is as follows:

On the sequence-level MRPC classification task, this implementation reproduces 84%-88% of the accuracy of the original implementation using a small BERT-base model.

On the token-level SQuAD task, this implementation reproduces the original implementation's 88.52 F1 result using a small BERT-base model.

The authors say they are working on reproducing the results on other tasks as well as on larger BERT models.

PyTorch implementation of the BERT model

This repository contains an op-for-op PyTorch reimplementation of the official TensorFlow repository for Google's BERT model. Google's official repository is published alongside the BERT paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.

This implementation can load any pretrained TensorFlow checkpoint (specifically Google's pretrained models) for BERT and provides a conversion script (see below).

Additionally, we will be adding model codes for multilingual and Chinese versions later this week.

Script: Load any TensorFlow checkpoint

Using the convert_tf_checkpoint_to_pytorch.py ​​script, you can convert any TensorFlow checkpoint of BERT (especially the official pretrained models released by Google) in a PyTorch save file.

This script takes as input the TensorFlow checkpoint (three files starting with bert_model.ckpt) and the associated configuration file (bert_config.json) and creates a PyTorch model for this configuration, loads weights from the TensorFlow checkpoint of the PyTorch model and saves the resulting model In a standard PyTorch save file, it can be imported using torch.load() (see examples in extract_features.py, run_classifier.py and run_squad.py).

Just run this conversion script once to get a PyTorch model. Then, you can ignore the TensorFlow checkpoint (the three files starting with bert_model.ckpt), but be sure to keep the configuration file (bert_config.json) and vocabulary file (vocab.txt), as these files are also required by the PyTorch model.

To run this specific conversion script, you need to have TensorFlow and PyTorch installed. The rest of the library only requires PyTorch.

Here is an example of the conversion process for a pretrained BERT-Base Uncased model:

export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12 python convert_tf_checkpoint_to_pytorch.py ​​--tf_checkpoint_path $BERT_BASE_DIR/bert_model.ckpt --bert_config_file $BERT_BASE_DIR/bert_config.json --pytorch_dump_path $BERT_BASE_DIR/pytorch_model.bin

You can download Google's pretrained transformation models here:

https://github.com/google-research/bert#pre-trained-models

PyTorch model for BERT

In this library, we provide three PyTorch models, which you can find in modeling.py:

BertModel - Basic BERT Transformer model

BertForSequenceClassification - BERT model with sequence classification head on top

BertForQuestionAnswering - BERT model with token classification head on top,

Below are some details of each type of model.

1. BertModel

BertModel is a basic BERT Transformer model consisting of a summed token, position and sequence embedding layer, followed by a series of identical self-attention blocks (12 blocks for BERT-base and 24 blocks for BERT-large).

The inputs and outputs are the same as those of the TensorFlow model.

Specifically, the input to the model is:

input_ids: a torch.LongTensor of shape [batch_size, sequence_length] containing the token index of the word in the vocabulary

token_type_ids: optional torch.LongTensor of shape [batch_size, sequence_length], select token type index in [0,1]. Type 0 corresponds to sentence A and type 1 corresponds to sentence B.

attention_mask: An optional torch.LongTensor with shape [batch_size, sequence_length] and index chosen in [0,1].

The output of the model is a tuple consisting of:

all_encoder_layers: a list of torch.FloatTensors of size [batch_size, sequence_length, hidden_size], which is a list of full sequences of hidden states at the end of each attention block (i.e. 12 full sequences for BERT-base, 24 full sequences for BERT-large )

pooled_output: a torch.FloatTensor of size [batch_size, hidden_size], which is the output of the classifier pretrained on top of the hidden state associated with the first character of the input (CLF) for training the Next-Sentence task (See BERT's paper).

An example of how to use this type of model is provided by the extract_features.py script, which can be used to extract the hidden states of the model for a given input.

2. BertForSequenceClassification

BertForSequenceClassification is a fine-tuning model that includes BertModel, and a sequence-level classifier on top of BertModel.

A sequence-level classifier is a linear layer that takes as input the last hidden state of the first character in the input sequence (see Figures 3a and 3b in the BERT paper).

An example on how to use such a model is provided by the run_classifier.py script, which can be used to fine-tune a single sequence (or sequence pair) classifier using BERT, e.g. for MRPC tasks.

3. BertForQuestionAnswering

BertForQuestionAnswering is a fine-tuning model, including BertModel, which has token-level classifiers on top of the full sequence of last hidden states.

The token-level classifier takes the full sequence of the last hidden state as input and computes a score for each token, (see Figures 3c and 3d of the BERT paper).

An example of how to use such a model is provided by the run_squad.py script, which can be used to fine-tune a token classifier using BERT, such as for the SQuAD task.

installation, requirements, testing

This code was tested on Python 3.5+. The prerequisites are:

PyTorch (>= 0.4.1)

tqdm

Install dependencies:

pip install -r ./requirements.txt

The tests folder contains a series of tests that can be run with pytest (install pytest if needed: pip install pytest).

You can run tests with:

python -m pytest -sv tests/batch training: gradient accumulation, multi-GPU, distributed training

The model parameters of BERT-base and BERT-large are 110M and 340M, respectively, and it is difficult to fine-tune them on a single GPU for good performance (batch size is 32 in most cases).

To help fine-tune these models, we introduce four techniques that can be activated in the fine-tuning scripts run_classifier.py and run_squad: optimized for CPU, gradient accumulation, multi-gpu, and distributed training.

For more details on how to use these techniques, you can read this article on PyTorch batch training techniques:

https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255

Fine-tuning of BERT: Running the Example

We show the same examples as the original implementation: fine-tuning a sequence-level classifier on the MRPC classification corpus and fine-tuning a token-level classifier on the question answering dataset SQuAD.

Before running these examples, the GLUE data should be downloaded and unpacked to some directory $GLUE_DIR. Also download the BERT-Base checkpoint, extract it to some directory $BERT_BASE_DIR, and convert it to the PyTorch version described in the previous section.

This sample code is based on Microsoft Research Paraphrase Corpus (MRPC) tuned BERT-Base and takes less than 10 minutes to run on a single K-80.

export GLUE_DIR=/path/to/glue python run_classifier.py --task_name MRPC --do_train --do_eval --do_lower_case --data_dir $GLUE_DIR/MRPC/ --vocab_file $BERT_BASE_DIR/vocab.txt --bert_config_file $BERT_BASE_DIR/bert_config .json --init_checkpoint $BERT_PYTORCH_DIR/pytorch_model.bin --max_seq_length 128 --train_batch_size 32 --learning_rate 2e-5 --num_train_epochs 3.0 --output_dir /tmp/mrpc_output/

Testing based on the hyperparameters of the original implementation achieves 84% ​​to 88% evaluation results.

The second example is to fine-tune BERT-Base on the SQuAD question answering task.

export SQUAD_DIR=/path/to/SQUAD python run_squad.py --vocab_file $BERT_BASE_DIR/vocab.txt --bert_config_file $BERT_BASE_DIR/bert_config.json --init_checkpoint $BERT_PYTORCH_DIR/pytorch_model.bin --do_train --do_predict --do_lower_case - -train_file $SQUAD_DIR/train-v1.1.json --predict_file $SQUAD_DIR/dev-v1.1.json --train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir ../debug_squad/

Training with the previous hyperparameters yields the following results:

{"f1": 88.52381567990474, "exact_match": 81.22043519394512} Fine tune BERT-large on GPU

The options listed above allow for easy fine-tuning of BERT-large on GPUs instead of using TPUs like the original implementation.

For example, fine-tuning a BERT-large model for the SQuAD task can be done in 18 hours with 4 k-80s on the server. Our results are similar (actually slightly higher) to the TensorFlow implementation:

{"exact_match": 84.56953642384106, "f1": 91.04028647786927}

To get these results, we used the following combinations:

Multi-GPU training (automatically activated on multi-GPU servers),

Gradient accumulation

The optimization step is performed on the CPU, and Adam's average is stored in RAM.

Here is the full list of hyperparameters we used in this run:

python ./run_squad.py --vocab_file $BERT_LARGE_DIR/vocab.txt --bert_config_file $BERT_LARGE_DIR/bert_config.json --init_checkpoint $BERT_LARGE_DIR/pytorch_model.bin --do_lower_case --do_train --do_predict --train_file $SQUAD_TRAIN --predict_file $SQUAD_EVAL --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir $OUTPUT_DIR/bert_large_bsz_24 --train_batch_size 24 --gradient_accumulation_steps 2 --optimize_on_cpu

Pressure Gauge

Pressure Gauge,Manometer Pressure Gauge,Automobile Pressure Gauge,High-Quality Pressure Gauges

ZHOUSHAN JIAERLING METER CO.,LTD , https://www.zsjrlmeter.com