Figure 1: Illustration of efficient fine-tuning methods added in v3.1 of adapter-transformers.

Throughout the last few months, the field of parameter-efficient methods for fine-tuning Transformer-based models has seen a wide range of new innovations, proposing new adapter methods (e.g. He et al., 2021; Liu et al., 2022) and applying them to new domains and tasks (e.g. Chen et al., 2022). With the newest release of our adapter-transformers library, version 3.1, we take a further step towards integrating the diverse possibilities of parameter-efficient fine-tuning methods by supporting multiple new adapter methods and Transformer architectures.

In the following sections, we highlight important new features and methods introduced with the new release. The full changelog can be found here.

You can find adapter-transformers on GitHub or install it via pip:

pip install -U adapter-transformers

New Adapter Methods

With the release of adapter-transformers v3 a few months back, we started the process of integrating new adapter methods. The new release v3.1 adds three new works that were released throughout the last year, namely LoRA (Hu et al., 2021), UniPELT (Mao et al., 2022) and (IA)^3 (Liu et al., 2022).

Previously, we have already integrated bottleneck adapters (Houlsby et al., 2019), Prefix Tuning (Li and Liang, 2021), parallel adapters, Mix-and-Match adapters (He et al., 2021) and Compacters (Mahabadi et al., 2021). For more on these methods, please refer the blog post for the release of v3. For a more general introduction to working with adapters, please refer to our documentation.

The following table compares the performance of our implementation of LoRA, (IA)^3 and bottleneck adapters, which are described in more detail afterwards, on the GLUE benchmark. We use roberta-base as the base Transformer model and train for 20 epochs with learning rates of 1e-3, 1e-4 and 1e-4 for (IA)^3, LoRA and bottleneck adapters, respectively.

Task Metric (IA)^3 LoRA Adapter (Houlsby)
COLA Matthews Correlation 59.53 58.35 59.81
MNLI Accuracy 85.98 87.15 86.68
MRPC F1 89.5 90.63 90.53
QNLI Accuracy 91.75 92.82 92.7
QQP F1 85.96 86.57 88.41
RTE Accuracy 73.41 72.08 77.9
SST2 Accuracy 93.92 94.11 94.5
STSB Spearmanr 89.78 89.82 90.58

LoRA

Figure 2: Illustration of the Low-Rank Adaptation (LoRA) method within one Transformer layer. Trained components are colored in shades of magenta.

Low-Rank Adaptation (LoRA) is an efficient fine-tuning technique proposed by Hu et al. (2021). LoRA injects trainable low-rank decomposition matrices into the layers of a pre-trained model. For any model layer expressed as a matrix multiplication of the form , it therefore performs a reparameterization, such that:

Here, and are the decomposition matrices and , the low-dimensional rank of the decomposition, is the most important hyperparameter.

While, in principle, this reparameterization can be applied to any weights matrix in a model, the original paper only adapts the attention weights of the Transformer self-attention sub-layer with LoRA. adapter-transformers additionally allows injecting LoRA into the dense feed-forward layers in the intermediate and output components of a Transformer block. You can configure the locations where LoRA weights should be injected using the attributes in the LoRAConfig class.

Example:

from transformers.adapters import LoRAConfig

config = LoRAConfig(r=8, alpha=16)
model.add_adapter("lora_adapter", config=config)

In the design of LoRA, Hu et al. (2021) also pay special attention to keeping the inference latency overhead compared to full fine-tuning at a minimum. To accomplish this, the LoRA reparameterization can be merged with the original pre-trained weights of a model for inference. Thus, the adapted weights are directly used in every forward pass without passing activations through an additional module. In adapter-transformers, this can be realized using the built-in merge_adapter() method:

model.merge_adapter("lora_adapter")

To continue training on this LoRA adapter or to deactivate it entirely, the merged weights first have to be reset again:

model.reset_adapter("lora_adapter")

(IA)^3

Figure 3: Illustration of the (IA)^3 method within one Transformer layer. Trained components are colored in shades of magenta.

Infused Adapter by Inhibiting and Amplifying Inner Activations ((IA)^3) is an efficient fine-tuning method proposed within the T-Few fine-tuning approach by Liu et al. (2022). (IA)^3 introduces trainable vectors into different components of a Transformer model which perform element-wise rescaling of inner model activations. For any model layer expressed as a matrix multiplication of the form , it therefore performs an element-wise multiplication with , such that:

Here, denotes element-wise multiplication where the entries of are broadcasted to the shape of .

Example:

from transformers.adapters import IA3Config

config = IA3Config()
model.add_adapter("ia3_adapter", config=config)

The implementation of (IA)^3, as well as the IA3Config class, are derived from the implementation of LoRA, with a few main modifications. First, (IA)^3 uses multiplicative composition of weights instead of additive composition as in LoRA. Second, the added weights are not further decomposed into low-rank matrices. Both of these modifications are controlled via the composition_mode configuration attribute by setting composition_mode="scale". Additionally, as the added weights are already of rank 1, r=1 is set.

Beyond that, both methods share the same configuration attributes that allow you to specify in which Transformer components rescaling vectors will be injected. Following the original implementation, IA3Config adds rescaling vectors to the self-attention weights (selfattn_lora=True) and the final feed-forward layer (output_lora=True). Further, you can modify which matrices of the attention mechanism to rescale by leveraging the attn_matrices attribute. By default, (IA)^3 injects weights into the key ('k') and value ('v') matrices, but not in the query ('q') matrix.

Finally, similar to LoRA, (IA)^3 also allows merging the injected parameters with the original weight matrices of the Transformer model. E.g.:

# Merge (IA)^3 adapter
model.merge_adapter("ia3_adapter")

# Reset merged weights
model.reset_adapter("ia3_adapter")

UniPELT

Figure 4: Illustration of the UniPELT method within one Transformer layer. Trained components are colored in shades of magenta.

An approach similar to the work of He et al. (2021) is taken by Mao et al. (2022) in their UniPELT framework. They, too, combine multiple efficient fine-tuning methods, namely LoRA, Prefix Tuning and bottleneck adapters, in a single unified setup. UniPELT additionally introduces a gating mechanism that controls the activation of the different submodules.

Concretely, for each adapted module , UniPELT adds a trainable gating value that is computed via a feed-forward network () and sigmoid activation () from the Transformer layer input states ():

These gating values are then used to scale the output activations of the injected adapter modules, e.g. for a LoRA layer:

In the configuration classes of adapter-transformers, these gating mechanisms can be activated via use_gating=True. The full UniPELT setup can be instantiated using UniPELTConfig1:

from transformers.adapters import UniPELTConfig

config = UniPELTConfig()
model.add_adapter("unipelt", config=config)

which is identical to the following ConfigUnion:

from transformers.adapters import ConfigUnion, LoRAConfig, PrefixTuningConfig, PfeifferConfig

config = ConfigUnion(
    LoRAConfig(r=8, use_gating=True),
    PrefixTuningConfig(prefix_length=10, use_gating=True),
    PfeifferConfig(reduction_factor=16, use_gating=True),
)
model.add_adapter("unipelt", config=config)

Finally, as the gating values for each adapter module might provide interesting insights for analysis, adapter-transformers comes with an integrated mechanism of returning all gating values computed during a model forward pass via the output_adapter_gating_scores parameter:

outputs = model(**inputs, output_adapter_gating_scores=True)
gating_scores = outputs.adapter_gating_scores

Note that this parameter is only available to base model classes and AdapterModel classes. In the example, gating_scores holds a dictionary of the following form:

{
    '<adapter_name>': {
        <layer_id>: {
            '<module_location>': np.array([...]),
            ...
        },
        ...
    },
    ...
}

The following table shows some initial results when running our UniPELT implementation. All adapter setups are trained for 20 epochs with a learning rate of 1e-4. Reported scores are accuracies 2.

Task Model UniPELT (ours) UniPELT (paper)
SST-2 bert-base-uncased 92.32 91.51
SST-2 roberta-base 94.61 ---
MNLI bert-base-uncased 84.53 83.89
MNLI roberta-base 87.41 ---

Further Updates

New model integrations

Version 3.1 adds adapter support to the DeBERTa and Vision Transformer (ViT) architectures already integrated into HuggingFace Transformers.

The ViT integration is of particular interest as it opens the application area of our adapter implementations to the computer vision domains. While most of the current work on adapter methods for Transformers happened in the NLP domain, adapters for Transformers in the vision domain have also been investigated recently (He et al., 2022; Chen et al., 2022).

Below, we show some initial results of our ViT integration, using google/vit-base-patch16-224 as the pre-trained base model:

Task Full FT Houlsby Pfeiffer LoRA Prefix Tuning
CIFAR-10 98.88 98.72 99.09 98.84 98.76
CIFAR-100 92.08 92.4 92.08 91.77 91.76

All scores are accuracies on the evaluation set 2.

adapter_summary() method

The new release adds an adapter_summary() method that provides information on all adapters currently loaded into a base model in tabular form. The method can be used as follows:

model = AutoAdapterModel.from_pretrained("roberta-base")
for name, config in ADAPTER_CONFIG_MAP.items():
    model.add_adapter(name, config=config)
print(model.adapter_summary())

... which produces this output:

================================================================================
Name                     Architecture         #Param      %Param  Active   Train
--------------------------------------------------------------------------------
pfeiffer                 bottleneck          894,528       0.718       0       1
houlsby                  bottleneck        1,789,056       1.435       0       1
pfeiffer+inv             bottleneck        1,190,592       0.955       0       1
houlsby+inv              bottleneck        2,085,120       1.673       0       1
compacter++              bottleneck           28,576       0.023       0       1
compacter                bottleneck           57,088       0.046       0       1
prefix_tuning            prefix_tuning     9,872,384       7.920       0       1
prefix_tuning_flat       prefix_tuning       552,960       0.444       0       1
parallel                 bottleneck        7,091,712       5.689       0       1
scaled_parallel          bottleneck        7,091,724       5.690       0       1
lora                     lora                294,912       0.237       0       1
ia3                      lora                 55,296       0.044       0       1
mam                      union            22,493,984      18.046       0       1
unipelt                  union            11,083,376       8.892       0       1
--------------------------------------------------------------------------------
Full model                               124,645,632     100.000               1
================================================================================

Transformers upgrade

Version 3.1 of adapter-transformers upgrades the underlying HuggingFace Transformers library from v4.17.0 to v4.21.3, bringing many new features and bug fixes created by HuggingFace.

References

  • Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., & Chen, W. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ArXiv, abs/2106.09685.
  • Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., & Raffel, C. (2022). Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning. ArXiv, abs/2205.05638.
  • Mao, Y., Mathias, L., Hou, R., Almahairi, A., Ma, H., Han, J., Yih, W., & Khabsa, M. (2021). UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning. ArXiv, abs/2110.07577.
  • He, X., Li, C., Zhang, P., Yang, J., & Wang, X. (2022). Parameter-efficient Fine-tuning for Vision Transformers. ArXiv, abs/2203.16329.
  • Chen, S., Ge, C., Tong, Z., Wang, J., Song, Y., Wang, J., & Luo, P. (2022). AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition. ArXiv, abs/2205.13535.

  1. Note that the implementation of UniPELT in adapter-transformers follows the implementation in the original code, which is slighlty different from the description in the paper. See here for more. 

  2. Reported results for adapter-transformers only contain a single run each without hyperparameter tuning.