Morphologically Biased Byte-Pair Encoding

mbpe-dyn is a research-focused implementation of byte-pair encoding¹. The training results are compatible with the corresponding implementation in huggingface/tokenizers.

mbpe-dyn extends the byte-pair encoding training algorithm as follows. Subword segmentations, which are a direct result of the trained merge rules, can be aligned to a provided gold segmentation. More specifically, the likelihood of merge rules that don't interfere with the targeted segmentation is increased. This increase can be tuned via a hyperparameter.

mbpe-dyn was initially created to gain some insights into how well byte-pair encoding approximates morphological boundaries. Literature suggests that byte-pair encoding produces subword boundaries that align poorly with linguistically meaningful reference segmentations².

Related Work

Limitations

When employing a tokenizer with a close-to-one fertility for large language model training, the intermediate subword segmentations during tokenization practically do not matter, since they are not relayed to the language model in any way. Therefore, we suspect our extension to be more useful in settings with tokenization fertility above one.

Segmenters

Segmenter	Description
`static`	The `static` segmenter is intended to be used with a morphological lexicon. Lexicon data can be loaded from `.tsv` files that match the format of the Morpho Challenge³ datasets.
`morfessor`	The `morfessor` segmenter allows segmentation via a Morfessor baseline model. Trained baseline models need to be converted to a binary format using our Protobuf definition. See the morfessor⁴⁵ package for details.

Evaluators

Evaluator	Description
Fertility	Measures the number of tokens per tokenized word. A fertility of one is ideal, meaning each input word maps to exactly one token. Byte-pair encoding optimizes tokenization fertility by prioritizing frequent subwords.
Boundary Precision and Recall	Evaluates word segmentations by matching them against a gold standard, calculating precision and recall. Based on the Morpho Challenge evaluation scripts. A partial port is available in the bpr package.
Merge Layer	Instead of only evaluating the final segmentation, this metric examines intermediate segmentations after each merge. It records the number of previous merges at which a morphological boundary was crossed.

Evaluation Results

#	Vocabulary	Boundary Precision Recall	Merge Layer	Fertility	Reference Overlap
00-en-m000	65536	0.65, 0.37, 0.47	0.78	1.07	1.00, 1.00
01-en-m010	65536	0.66, 0.38, 0.48	0.79	1.07	0.97, 0.92
02-en-m020	65536	0.65, 0.39, 0.49	0.79	1.07	0.94, 0.88
03-en-m030	65536	0.65, 0.40, 0.49	0.80	1.07	0.92, 0.84
04-en-m040	65536	0.65, 0.40, 0.50	0.80	1.07	0.90, 0.81
05-en-m050	65536	0.65, 0.41, 0.50	0.80	1.07	0.88, 0.78
06-en-m060	65536	0.64, 0.42, 0.51	0.81	1.07	0.85, 0.75
07-en-m070	65536	0.64, 0.43, 0.51	0.81	1.08	0.82, 0.72
08-en-m080	65536	0.63, 0.43, 0.51	0.81	1.08	0.78, 0.68
09-en-m090	65536	0.63, 0.44, 0.52	0.82	1.09	0.74, 0.64
10-en-m100	65536	0.62, 0.47, 0.53	0.82	1.17	0.64, 0.56

#	Vocabulary	Boundary Precision Recall	Merge Layer	Fertility	Reference Overlap
00-en-m000	32768	0.54, 0.44, 0.49	0.80	1.12	1.00, 1.00
01-en-m010	32768	0.55, 0.45, 0.49	0.81	1.12	0.96, 0.91
02-en-m020	32768	0.55, 0.45, 0.50	0.81	1.12	0.94, 0.87
03-en-m030	32768	0.56, 0.46, 0.50	0.82	1.12	0.92, 0.84
04-en-m040	32768	0.56, 0.46, 0.51	0.82	1.12	0.89, 0.80
05-en-m050	32768	0.56, 0.47, 0.51	0.82	1.12	0.87, 0.77
06-en-m060	32768	0.56, 0.47, 0.51	0.82	1.13	0.84, 0.74
07-en-m070	32768	0.57, 0.48, 0.52	0.83	1.13	0.81, 0.71
08-en-m080	32768	0.58, 0.48, 0.52	0.83	1.14	0.78, 0.68
09-en-m090	32768	0.57, 0.48, 0.52	0.83	1.15	0.74, 0.64
10-en-m100	32768	0.58, 0.49, 0.53	0.83	1.22	0.67, 0.58

#	Vocabulary	Boundary Precision Recall	Merge Layer	Fertility	Reference Overlap
00-en-m000	16384	0.48, 0.48, 0.48	0.82	1.20	1.00, 1.00
01-en-m010	16384	0.49, 0.49, 0.49	0.83	1.20	0.96, 0.92
02-en-m020	16384	0.49, 0.49, 0.49	0.83	1.20	0.93, 0.87
03-en-m030	16384	0.49, 0.50, 0.49	0.83	1.20	0.91, 0.84
04-en-m040	16384	0.50, 0.50, 0.50	0.83	1.20	0.88, 0.80
05-en-m050	16384	0.50, 0.51, 0.50	0.83	1.21	0.86, 0.77
06-en-m060	16384	0.51, 0.51, 0.51	0.84	1.21	0.84, 0.75
07-en-m070	16384	0.52, 0.52, 0.52	0.84	1.22	0.81, 0.73
08-en-m080	16384	0.52, 0.53, 0.52	0.84	1.22	0.78, 0.69
09-en-m090	16384	0.52, 0.53, 0.53	0.84	1.24	0.76, 0.67
10-en-m100	16384	0.53, 0.53, 0.53	0.84	1.28	0.71, 0.62

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
assets		assets
bpr		bpr
morfessor		morfessor
testdata/fuzz/FuzzFSA_FindAll		testdata/fuzz/FuzzFSA_FindAll
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
bpr.go		bpr.go
bpr_test.go		bpr_test.go
chunk.go		chunk.go
chunk_test.go		chunk_test.go
decoder.go		decoder.go
dict.go		dict.go
dict_test.go		dict_test.go
eval.go		eval.go
fertilty.go		fertilty.go
fsa.go		fsa.go
fsa_test.go		fsa_test.go
go.mod		go.mod
go.sum		go.sum
main.go		main.go
merge.go		merge.go
merge_test.go		merge_test.go
mlayer.go		mlayer.go
model.go		model.go
model_test.go		model_test.go
morfessor.go		morfessor.go
plot.go		plot.go
pretokenizer.go		pretokenizer.go
progressbar.go		progressbar.go
queue.go		queue.go
queue_test.go		queue_test.go
reference.go		reference.go
segmenter.go		segmenter.go
sequence.go		sequence.go
serialize.go		serialize.go
static.go		static.go
tokenizer.go		tokenizer.go
trainer.go		trainer.go
uniformizer.go		uniformizer.go
uniformizer_test.go		uniformizer_test.go
utf8.go		utf8.go
utf8_test.go		utf8_test.go
utility.go		utility.go
utility_test.go		utility_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Morphologically Biased Byte-Pair Encoding

Related Work

Limitations

Segmenters

Evaluators

Evaluation Results

About

Releases

Packages

Languages

jonasknobloch/mbpe-dyn

Folders and files

Latest commit

History

Repository files navigation

Morphologically Biased Byte-Pair Encoding

Related Work

Limitations

Segmenters

Evaluators

Evaluation Results

Footnotes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages