Skip to content

jonasknobloch/mbpe-dyn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Morphologically Biased Byte-Pair Encoding

mbpe-dyn is a research-focused implementation of byte-pair encoding1. The training results are compatible with the corresponding implementation in huggingface/tokenizers.

mbpe-dyn extends the byte-pair encoding training algorithm as follows. Subword segmentations, which are a direct result of the trained merge rules, can be aligned to a provided gold segmentation. More specifically, the likelihood of merge rules that don't interfere with the targeted segmentation is increased. This increase can be tuned via a hyperparameter.

mbpe-dyn was initially created to gain some insights into how well byte-pair encoding approximates morphological boundaries. Literature suggests that byte-pair encoding produces subword boundaries that align poorly with linguistically meaningful reference segmentations2.

Related Work

Limitations

When employing a tokenizer with a close-to-one fertility for large language model training, the intermediate subword segmentations during tokenization practically do not matter, since they are not relayed to the language model in any way. Therefore, we suspect our extension to be more useful in settings with tokenization fertility above one.

Segmenters

Segmenter Description
static The static segmenter is intended to be used with a morphological lexicon. Lexicon data can be loaded from .tsv files that match the format of the Morpho Challenge3 datasets.
morfessor The morfessor segmenter allows segmentation via a Morfessor baseline model. Trained baseline models need to be converted to a binary format using our Protobuf definition. See the morfessor45 package for details.

Evaluators

Evaluator Description
Fertility Measures the number of tokens per tokenized word. A fertility of one is ideal, meaning each input word maps to exactly one token. Byte-pair encoding optimizes tokenization fertility by prioritizing frequent subwords.
Boundary Precision and Recall Evaluates word segmentations by matching them against a gold standard, calculating precision and recall. Based on the Morpho Challenge evaluation scripts. A partial port is available in the bpr package.
Merge Layer Instead of only evaluating the final segmentation, this metric examines intermediate segmentations after each merge. It records the number of previous merges at which a morphological boundary was crossed.

Evaluation Results

plot

# Vocabulary Boundary Precision Recall Merge Layer Fertility Reference Overlap
00-en-m000 65536 0.65, 0.37, 0.47 0.78 1.07 1.00, 1.00
01-en-m010 65536 0.66, 0.38, 0.48 0.79 1.07 0.97, 0.92
02-en-m020 65536 0.65, 0.39, 0.49 0.79 1.07 0.94, 0.88
03-en-m030 65536 0.65, 0.40, 0.49 0.80 1.07 0.92, 0.84
04-en-m040 65536 0.65, 0.40, 0.50 0.80 1.07 0.90, 0.81
05-en-m050 65536 0.65, 0.41, 0.50 0.80 1.07 0.88, 0.78
06-en-m060 65536 0.64, 0.42, 0.51 0.81 1.07 0.85, 0.75
07-en-m070 65536 0.64, 0.43, 0.51 0.81 1.08 0.82, 0.72
08-en-m080 65536 0.63, 0.43, 0.51 0.81 1.08 0.78, 0.68
09-en-m090 65536 0.63, 0.44, 0.52 0.82 1.09 0.74, 0.64
10-en-m100 65536 0.62, 0.47, 0.53 0.82 1.17 0.64, 0.56
# Vocabulary Boundary Precision Recall Merge Layer Fertility Reference Overlap
00-en-m000 32768 0.54, 0.44, 0.49 0.80 1.12 1.00, 1.00
01-en-m010 32768 0.55, 0.45, 0.49 0.81 1.12 0.96, 0.91
02-en-m020 32768 0.55, 0.45, 0.50 0.81 1.12 0.94, 0.87
03-en-m030 32768 0.56, 0.46, 0.50 0.82 1.12 0.92, 0.84
04-en-m040 32768 0.56, 0.46, 0.51 0.82 1.12 0.89, 0.80
05-en-m050 32768 0.56, 0.47, 0.51 0.82 1.12 0.87, 0.77
06-en-m060 32768 0.56, 0.47, 0.51 0.82 1.13 0.84, 0.74
07-en-m070 32768 0.57, 0.48, 0.52 0.83 1.13 0.81, 0.71
08-en-m080 32768 0.58, 0.48, 0.52 0.83 1.14 0.78, 0.68
09-en-m090 32768 0.57, 0.48, 0.52 0.83 1.15 0.74, 0.64
10-en-m100 32768 0.58, 0.49, 0.53 0.83 1.22 0.67, 0.58
# Vocabulary Boundary Precision Recall Merge Layer Fertility Reference Overlap
00-en-m000 16384 0.48, 0.48, 0.48 0.82 1.20 1.00, 1.00
01-en-m010 16384 0.49, 0.49, 0.49 0.83 1.20 0.96, 0.92
02-en-m020 16384 0.49, 0.49, 0.49 0.83 1.20 0.93, 0.87
03-en-m030 16384 0.49, 0.50, 0.49 0.83 1.20 0.91, 0.84
04-en-m040 16384 0.50, 0.50, 0.50 0.83 1.20 0.88, 0.80
05-en-m050 16384 0.50, 0.51, 0.50 0.83 1.21 0.86, 0.77
06-en-m060 16384 0.51, 0.51, 0.51 0.84 1.21 0.84, 0.75
07-en-m070 16384 0.52, 0.52, 0.52 0.84 1.22 0.81, 0.73
08-en-m080 16384 0.52, 0.53, 0.52 0.84 1.22 0.78, 0.69
09-en-m090 16384 0.52, 0.53, 0.53 0.84 1.24 0.76, 0.67
10-en-m100 16384 0.53, 0.53, 0.53 0.84 1.28 0.71, 0.62

Footnotes

  1. Neural Machine Translation of Rare Words with Subword Units

  2. Byte Pair Encoding is Suboptimal for Language Model Pretraining

  3. Morpho Challenge 2005-2010: Evaluations and Results

  4. Unsupervised Discovery of Morphemes

  5. Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline

About

Morphologically biased byte-pair encoding training

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages