Adapters are becoming more and more important in machine learning for NLP. For instance, they enable us to efficiently train and share new task-specific models. Adapters are small layers that are stitched into pre-trained transformer-based models. During training, only the parameters of the adapter layers are finetuned, while the parameters of the pre-trained model remain frozen. As a result, it is sufficient to only store the adapter layers instead of storing fully finetuned models separately for each task. Furthermore, the lower number of parameters requires less memory and makes it easier to share the trained adapters. Adapters also enable new possibilities in transfer learning. As adapters are encapsulated between frozen layers, they can be regarded as modular units which can be composed in a number of different ways (For more details and examples check out this blog post). Bapna et al. (2019) have shown that adapters are useful for sequence to sequence tasks. On a neural machine translation task, they achieved similar results with adapters as compared to a fully finetuned model. The modularity aspect of adapters in zero-shot machine translation has recently been demonstrated by Philip et al. (2020).

The AdapterHub framework makes adapters easy to use. Up until now, the framework included adapters for the models BERT, RoBERTa, XML-RoBERTa and DistilBERT. In the new version 2.0, the framework now also provides adapters for the language generation models BART and GPT-2. This will allow researchers and engineers to use adapters for sequence-to-sequence tasks.

Results of BART and GPT-2 with adapters

Before we dive into generation tasks, we will take a look at the performance on the GLUE benchmark. We compare the scores of a fully finetuned model with the scores of adapter-based models, either using the adapter configuration of Pfeiffer et al. (2020a) or Houlsby et al. (2020). The GPT-2 model and BART models achieve the following scores:

GPT-2 Full Pfeiffer Houlsby
RTE 65.0 67.1 67.5
MRPC 83.8 83.5 80.4
STS-B 86.7 85.3 85.4
CoLA 33.6 43.0 41.2
SST-2 90.0 90.5 90.9
QNLI 87.6 88.2 88.5
MNLI 82.2 81.6 81.7
QQP 88.5 87.1 87.7

The fully finetuned GPT-2 model is trained for 4 epochs with a learning rate of 1e-4. The adapters are trained for 10 epochs with a learning rate of 1e-4.

BART Full Pfeiffer Houlsby
RTE 71.12 69.7 69.1
MRPC 87.5 86.8 88.2
STS-B 89.0 88.1 88.3
CoLA 46.6 46.1 45.6
SST-2 92.7 93.7 93.6
QNLI 91.6 92.2 93.6
MNLI 85.7 85.9 85.9
QQP 89.3 88.4 88.6

The fully-finetuned BART model is trained for 3 epochs with a learning rate of 4e-5. The adapters are trained with early stopping for a maximum of 15 epochs with a learning rate of 1e-4.

The results of the adapters are comparable to those of the fully finetuned model. On some tasks such as SST-2, the adapters achieve a higher score than the fully finetuned model for GPT-2 and BART. This matches the results of other models with adapters. In general, we can use adapters instead of fully finetuning the model without a deterioration in downstream task performance.

Now we will take a look at the scores for sequence-to-sequence tasks. We train a GPT-2 model on the task proposed by Chen et al. (2020). This task requires the model to learn to generate entailing sentences w.r.t. the input. For example, given a table containing the release dates for an album, the model is provided with a template and and has the objective to fill in the blanks.

Template: [ENT] was released in 6 [ENT] in [ENT].

Gold sentence: Black Ice was released in 6 Countries in 2008.

It is not sufficient for the model to simply enter a number from the table; it needs to count all countries the album was released in, in 2008. We trained the GPT-2 model with small-sized GPT-2 vocabulary using maximum likelihood estimation. The results are given in the following table:

BLEU-1 BLEU-2 BLEU-3 Adv-Acc
GPT-2 48.8 27.1 12.6 62.3
GPT-2 + Pfeiffer 46.3 24.8 11.2 60.1
GPT-2 + Houlsby 45.5 23.9 10.5 59.7

We observe that the models with adapters achieve a competitive results to full model fine-tuning. However, adapters have several advantages over fully finetuning, e.g., shorter training times, they require less memory to be stored, and they can easily be shared.

To test the BART model on sequence-to-sequence tasks, we evaluated the model on the CNN/Daily Mail dataset (Hermann et al. (2015); See et al., 2017) and the extreme summary dataset (XSum) dataset (Narayan et al., 2018). Both tasks have the objective to summarize newspaper articles. The main difference is that XSum requires the model to output short one sentence summaries. The results of the fully finetuned BART model and the adapters are as follows:

R1 R2 RL
CNN/Daily mail 44.16 21.28 40.90
CNN/Daily mail + Pfeiffer 43.40 20.86 30.66
R1 R2 RL
XSum 45.14 22.27 37.26
XSum + Pfeiffer 43.56 20.56 35.56
XSum + Houlsby 44.03 20.90 36.01

Similar to the GPT-2 model, the BART model achieves the highest score when it is fully fine-tuned. The models with adapters achieve slightly lower scores, further indicating that adapters might in general achieve slightly lower scores on sequence-to-sequence tasks. However, as previously stated, they have several other advantages.

Version 2.0 of the AdapterHub framework opens up new possibilities such as experimenting with summarization and text generation tasks. Adapters for BART and GPT-2 enable us to tackle a wide variety of text generation tasks with adapters.

Hands-on example: Train an adapter to write poems

Open Colab
To illustrate how we can use adapters for text generation, we provide a hands-on example for training adapters within GPT-2 on a poem dataset by Sheng et al. (2020) and let it create novel poems. The dataset contains poems from the Gutenberg project. The full code is available in the corresponding colab notebook linked above. If you have read the previous blog post, this might look very familiar. First, we need to add our adapters. This is easily done with just a few lines of code:

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")
# Add a new adapter
model.add_adapter("poem")
# Activate the adapter for training
model.model.train_adapter("poem")

We have created the GPT-2 model and added an adapter with add_adapter(). We only need to pass the name of the adapter "poem". After adding the new adapter, we call train_adapter() and pass the name of our adapter. This does two things: Firstly, it freezes all parameters of the pre-trained model such that only the parameters of the adapter are updated during training. Secondly, it activates the adapter so that it is used in the forward pass. Next, we can train our model the same way we would without an adapter. In the end, we can save our trained adapter as follows.

model.save_adapter("path/to/adapter", "poem")

We call save_adapter() and provide the path to the directory where the adapter should be saved and the name of the adapter we want to save. Now that we have our trained adapter, we want to generate some poems and see what it has learned. First, we need to create a model with a language modeling head and load our trained adapter. Then we activate the loaded adapter.

from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained("gpt2")
model.load_adapter("path/to/adapter")
model.set_active_adapters("poem")

With load_adapter() we can load an adapter from the Hub by passing the name of the adapter specified in the hub. We can also load a local adapter by providing the path to the adapter. Then, we activate our adapter such that is used in the forward pass with set_active_adapters(). Finally, we can think of a beginning of a poem and let the model finish it. In this case, the model generates 5 poems for the given beginning. We can choose the one we like most from those. We choose to start our poem with "In the night". One of the poems our model generated was:

In the night;
when the stars shine on her head.
the mounds are deep,
and the water's dark,
and the water's cold
and with her hand,
with her lips,
in song and song,
the sound of the birds

This can easily be applied to other datasets. Feel free to train your own adapter and upload it at the Hub or browse the adapters trained by the community.

Conclusion

The new version 2.0 of the AdapterHub framework supports adapters for GPT-2 and BART. The support of these two models offers new possibilities in solving sequence to sequence tasks with adapters. To checkout AdapterHub and its other features, visit us on GitHub.

Acknowledgements

We thank André Fellenberg for the BART illustration.

References

Citation

@misc{sterz_2021, 
  title={Adapters for Generative and Seq2Seq Models in NLP},
  url={https://adapterhub.ml/blog/2021/04/adapters-for-generative-and-seq2seq-models-in-nlp/}, 
  author={Hannah Sterz and Clifton Poth and Andreas R\"uckl\'e and Jonas Pfeiffer}, 
  year={2021}, 
  month={Apr}
}

* equal contribution