Exploring BERT's Vocabulary
Deep contextualized word representations have taken word representation to the next level by assigning word vectors to words in context - typically a sentence - instead of assigning a vector to each word type. ELMO and BERT are the most popular and successful examples of these embeddings. The authors of BERT released several versions of BERT pretrained on massive amounts of data, including a multilingual version which supports 104 languages in a single model.
Multilingual BERT Vocabulary
I was admittedly intrigued by the idea of a single model for 104 languages with
a large shared vocabulary. The vocabulary is 119,547 WordPiece model, and the
input is tokenized into word pieces (also known as subwords) so that each
word piece is an element of the dictionary. Non-word-initial units are prefixed
with ##
as a continuation symbol except for Chinese characters which are
surrounded by spaces before any tokenization takes place. The tokenizer favors
longer word pieces with a de facto character-level model as a fallback as every
character is part of the vocabulary as a possible word piece. An example of
such tokenization using Hugging Face’s PyTorch implementation of
BERT looks like this:
This segmentation is purely frequency based and it is very different from what a true morphological segmenter would output (El-végez-het-itek). This example is certainly longer than an average Hungarian word and the average word is not tokenized this aggressively but BERT does produce a large number of word pieces for certain languages. I will examine this closer in this post.
The first 106 symbols are reserved for constants like PAD
and UNK
and
unused placeholders for application specific symbols. 36.5% of the vocabulary
are non-initial word pieces. The alphabet consists of 9,997 unique characters
that are defined as word-initial (C
) and continuation symbols (##C
), which
together make up 19,994 word pieces. The rest are multicharacter word
pieces of various length.
As the length distribution shows single character symbols are the largest group with a sharp drop after that and a somewhat surprising second peak at 4-character symbols. Most units are very short with a few exceptions over 20 characters. The list of the 20 longest symbols features many German compounds and other long words frequent in Wikipedia:
Token | Length | Token | Length |
---|---|---|---|
bewerkingsgeschiedenis | 22 | Auseinandersetzungen | 20 |
ուսումնասիրությունների | 22 | தொகுக்கப்பட்டுள்ளது | 19 |
Territorialgeschichte | 21 | delstatshuvudstaden | 19 |
Europameisterschaften | 21 | Bevölkerungsstandes | 19 |
huvudavrinningsområde | 21 | Nationalsozialisten | 19 |
தேர்ந்தெடுக்கின்றனர் | 20 | Weltmeisterschaften | 19 |
Rechtswissenschaften | 20 | delavrinningsområde | 19 |
eenoogkreeftjessoort | 20 | bevolkingsdichtheid | 19 |
Årsmedeltemperaturen | 20 | Nationalsozialismus | 19 |
நிர்வகிக்கப்படுகிறது | 20 | Europameisterschaft | 19 |
The vocabulary was created using the top 100 Wikipedia dumps. The authors oversampled small Wikipedias so we should see many non-English looking word pieces in the vocabulary. To test this, I used the predefined ranges of Unicode code points obtained from here. If you are unfamiliar with Unicode, I highly recommend Joel Spolsky’s amazing introduction. I grouped the ranges into overlapping macro categories such as ASCII, Latin (includes diacritics), CJK (Chinese, Japanese, Korean), CJK+Kana (CJK, Hiragana, Katakana), Cyrillic, Korean Hangul, Indian (various alphabets used mainly in India) etc. The exact mapping is available here.
We can match ranges of Unicode character with Python regular expressions like this:
we can extend this regular expression to allow digits but require at least one
character from the Unicode range. The abstract regex looks like this [Hiragana
or digit]*[Hiragana]+[Hiragana or digit]*
, it accepts one or more
([Hiragana]+
) prefixed and/or suffixed by zero or more Hiragana/digits
([Hiragana or digit]*
).
I defined a regular expression for each macro category. A word piece belongs to the category if all of its characters fall into the category or are digits. Pure digits are excluded from non-Latin or ASCII. It turns out that 78% of the word pieces fall into the Latin category, most of them are pure ASCII (77% of all pieces). The rest of the categories are much smaller, probably due to their respective Wikipedias being much smaller than that of the European languages. This table lists how many word pieces were matched by each category:
Script | Sum | % |
---|---|---|
Latin | 93495 | 78.21 |
ASCII | 92327 | 77.23 |
CJK+kana | 14932 | 12.49 |
Cyrillic | 13782 | 11.53 |
CJK | 13601 | 11.38 |
Indian | 6545 | 5.47 |
Arabic | 4873 | 4.08 |
Korean | 3273 | 2.74 |
Hebrew | 2482 | 2.08 |
Greek | 1566 | 1.31 |
Kana | 1331 | 1.11 |
Armenian | 1236 | 1.03 |
Georgian | 705 | 0.59 |
Misc | 639 | 0.53 |
Thai | 370 | 0.31 |
Myanmar | 271 | 0.23 |
Tibetan | 40 | 0.03 |
Mongolian | 4 | 0.0 |
Tokenizing Universal Dependency Treebanks
Universal Dependencies (UD) is a framework for
grammatical annotation with treebanks available in more than 70 languages, 54
overlapping with BERT’s language list. The smallest treebanks are Tagalog (55
sentences) and Yoruba (100 sentences), while the largest ones are Czech
(127,507) and Russian (69,683). I tokenized each treebank with BertTokenizer
and compared the tokenization with the gold standard tokenization. The input to
BertTokenizer
was the full text form of the sentence.
Let’s define fertility, borrowed from statistical machine translation, as the
average number of BERT word pieces corresponding with a single real token. The
example ['El', '##vé', '##ge', '##zhet', '##ite', '##k']
has a fertility 6,
but we can expect lower values on average. Fertility would be 1 if all tokens
were in BERT’s vocabulary. As illustrated in this plot, BERT has the lowest
fertility in Galician (1.16) and the highest in Telugu (2.85). It should be
noted UD Treebanks differ in tokenization, for example Japanese tokenizes
inflections as separate tokens, while Korean does not, even though their
morphology shares many similarities. More aggressive tokenization results in
lower fertility values.
Examining the proportion of continuation word pieces shows us a different picture. Since Chinese characters are presegmented, there are barely any continuation word pieces (0.2%), with English (13.1%) and Vietnamese (13.5%) following. The highest proportion of continuation word pieces can be found in Tamil (67.3%) and Telugu (64.7%).
Similar trends can be found in the sentence length distribution defined as the number of tokens in a sentence. Here is a comparison for a few cherrypicked languages. The x-axes represent the sentence length in tokens and the y-axes are the proportion of sentences of certain length. Fertility values are listed in parentheses above each plot. The full list in alphabetical order is available here, and sorted by fertility here.
Finally the prettiest plots show how BERT affects the distribution of token length in the same languages. The bars represent the ratio of N-long BERT word pieces, while the blue curves show the original token length distribution. The y-axes are scaled differently, the bars’ scale is shown on the left, and the curve’s scale is shown on the right side of each plot. The full list in alphabetical order is available here, and sorted by fertility here.
Conclusion
I explored BERT’s multilingual vocabulary by itself and through its tokenization on 54 languages that have UD treebanks. I found that the majority of elements in BERT’s vocabulary are that of the European languages, most of them pure ASCII. Examining the output of BERT tokenizer confirmed that the tokenizer keeps English mostly intact while it may generate different token distributions in morphologically rich languages. The degree of how much this tokenization resembles a morphological segmentation remains to be explored.
Code
The script used to extract BERT and UD stats can be found here. This notebook contains all code used to generate the plots. Raw statistics are available as TSV in the same directory.
Acknowledgment
I would like to thank Ofir Press and my Hungarian colleagues for feedback on earlier drafts of this article.