Firstly need to generate data, the best approach was split with GroupKFold here are examples 1, 2, 3.
I suggest to create bash file with content like this:
#!/bin/bash
# fold number to use for loading data
FOLD_IDX=123
# paths to data
export TRAIN_PICKLE=data/folds/tgkf_train_${FOLD_IDX}.pkl
export VALID_PICKLE=data/folds/tgkf_valid_${FOLD_IDX}.pkl
# notifying about training progress
export CATALYST_TELEGRAM_TOKEN="<bot token>"
export CATALYST_TELEGRAM_CHAT_ID="<chat id>"
# experiment variables
NAME=<experiment name>
CONF=configs/<config>.yml
LOGDDIR=logs/folds/<logdir>
# remove if exists
[ -e ${LOGDDIR} ] && rm -rf ${LOGDDIR} && echo "Removed existed dir with logs - '${LOGDDIR}'"
# training process
catalyst-dl run --expdir src --logdir ${LOGDDIR} --config ${CONF} --verbose
and bash file for tracing models:
#!/bin/bash
CONF=<path to .yml file with config>
INPUT_TYPE='<model input type>'
LOG_DIR_PREFIX=<prefix path to logs directory>
ONAME_PATTERN=<output name pattern>
CHECKPOINT=${LOG_DIR_PREFIX}_0/checkpoints/best.pth
OUTPUT=${ONAME_PATTERN}_0.pt
python3 -m src.trace --config ${CONF} \
--state ${CHECKPOINT} \
--input-type ${INPUT_TYPE} \
--out ${OUTPUT}
CHECKPOINT=${LOG_DIR_PREFIX}_1/checkpoints/best.pth
OUTPUT=${ONAME_PATTERN}_1.pt
python3 -m src.trace --config ${CONF} \
--state ${CHECKPOINT} \
--input-type ${INPUT_TYPE} \
--out ${OUTPUT}
CHECKPOINT=${LOG_DIR_PREFIX}_2/checkpoints/best.pth
OUTPUT=${ONAME_PATTERN}_2.pt
python3 -m src.trace --config ${CONF} \
--state ${CHECKPOINT} \
--input-type ${INPUT_TYPE} \
--out ${OUTPUT}
echo "Traced all models based on config - ${CONF}."
ARCHIVE_NAME=${ONAME_PATTERN}.zip
zip -r ${ARCHIVE_NAME} ${ONAME_PATTERN}_*
echo "Compressed to archive - ${ARCHIVE_NAME}."
Final submission consist of 8 models - 5 base and 3 large models trained on different number of folds - 5 and 3 folds respectively.
At the end of competition I found a perfect finetuning congiguration:
model_params:
model: <model>
pretrain_dir: <tokenizer>
num_classes: 30
pad_token: <pad token index>
stages:
state_params:
main_metric: spearman
minimize_metric: False
data_params:
num_workers: &nw 6
batch_size: *nw
tokenizer: <tokenizer>
criterion_params:
criterion: BCEWithLogitsLoss
callbacks_params:
loss:
callback: CriterionCallback
input_key: targets
optim:
callback: OptimizerCallback
spearman:
callback: SpearmanScoreCallback
classes: 30
saver:
callback: CheckpointCallback
early_stopping:
callback: EarlyStoppingCallback
patience: 2
metric: spearman
minimize: False
telegram_logger:
callback: TelegramLogger
log_on_stage_start: False
log_on_loader_start: False
metric_names:
- loss
- spearman
stage1:
state_params:
num_epochs: 10
optimizer_params:
optimizer: RAdam
lr: 0.00001
betas: &betas [0.9, 0.98]
eps: &eps 0.000000001
stage2:
state_params:
num_epochs: 5
optimizer_params:
optimizer: RAdam
lr: 0.000001
betas: *betas
eps: *eps
stage3:
state_params:
num_epochs: 5
optimizer_params:
optimizer: RAdam
lr: 0.0000001
betas: *betas
eps: *eps
Config file - bert_base_wit_te_stats_and_cats.yml
Dataset class - FoldTFDCFSF
Folds:
files | score |
---|---|
tgkf_train_00.pkl , tgkf_train_00.json |
0.3948 |
tgkf_train_01.pkl , tgkf_train_01.json |
0.3800 |
tgkf_train_02.pkl , tgkf_train_02.json |
0.3996 |
tgkf_train_03.pkl , tgkf_train_03.json |
0.3929 |
tgkf_train_04.pkl , tgkf_train_04.json |
0.3837 |
Config file - bert_base_wit_te_stats_and_cats.yml
Dataset class - TFDCFSF
Folds:
files | score |
---|---|
tgkf_train_00.pkl |
0.3901 |
tgkf_train_01.pkl |
0.3742 |
tgkf_train_02.pkl |
0.4055 |
tgkf_train_03.pkl |
0.3941 |
tgkf_train_04.pkl |
0.3790 |
Config file - bert_base_cased_with_cfs.yml
Dataset class - TransformerFieldsDataset
Folds:
files | score |
---|---|
tgkf_train_00.pkl |
0.3855 |
tgkf_train_01.pkl |
0.3688 |
tgkf_train_02.pkl |
0.3957 |
tgkf_train_03.pkl |
0.3827 |
tgkf_train_04.pkl |
0.3822 |
Config file - roberta_base_with_cfs.yml
Dataset class - RFDCFSF
files | score |
---|---|
tgkf_train_00.pkl |
0.3904 |
tgkf_train_01.pkl |
0.3773 |
tgkf_train_02.pkl |
0.3993 |
tgkf_train_03.pkl |
0.3879 |
tgkf_train_04.pkl |
0.3810 |
Config file - xlnet_base_cased_less_stages.yml
Dataset class - XLNetDataset
files | score |
---|---|
tgkf_train_00.pkl |
0.3830 |
tgkf_train_01.pkl |
0.3754 |
tgkf_train_02.pkl |
0.3953 |
tgkf_train_03.pkl |
0.3840 |
tgkf_train_04.pkl |
0.3744 |
Overall large models was trained in two stages:
...
stages:
state_params:
main_metric: spearman
minimize_metric: False
data_params:
num_workers: &nw 2
batch_size: *nw
tokenizer: <tokenizer>
criterion_params:
criterion: BCEWithLogitsLoss
callbacks_params:
loss:
callback: CriterionCallback
input_key: targets
optim:
callback: OptimizerCallback
accumulation_steps: 16
spearman:
callback: SpearmanScoreCallback
classes: 30
saver:
callback: CheckpointCallback
early_stopping:
callback: EarlyStoppingCallback
patience: 3
metric: spearman
minimize: False
stage1:
state_params:
num_epochs: 10
optimizer_params:
optimizer: Adam
lr: 0.00003
and then best state was additionaly trained:
stages:
state_params:
main_metric: spearman
minimize_metric: False
data_params:
num_workers: &nw 2
batch_size: *nw
tokenizer: <tokenizer>
criterion_params:
criterion: BCEWithLogitsLoss
callbacks_params:
loss:
callback: CriterionCallback
input_key: targets
optim:
callback: OptimizerCallback
accumulation_steps: 16
spearman:
callback: SpearmanScoreCallback
classes: 30
saver:
callback: CheckpointCallback
early_stopping:
callback: EarlyStoppingCallback
patience: 3
metric: spearman
minimize: False
stage1:
state_params:
num_epochs: 10
optimizer_params:
optimizer: Adam
lr: 0.000005
Config files - bert_large_with_cfs.yml
-> bert_large_with_cfs_continue.yml
Dataset class - TransformerFieldsDataset
files | score |
---|---|
tgkf_train_0.pkl |
0.3940 |
tgkf_train_1.pkl |
0.3908 |
tgkf_train_2.pkl |
0.3847 |
Config files - roberta_large_with_cfs.yml
-> roberta_large_with_cfs_continue.yml
Dataset class - RFDCFSF
files | score |
---|---|
tgkf_train_0.pkl |
0.3909 |
tgkf_train_1.pkl |
0.3878 |
tgkf_train_2.pkl |
0.3900 |
Config files - xlnet_large_cased.yml
-> xlnet_large_cased_continue.yml
Dataset class - XLNetDataset
files | score |
---|---|
tgkf_train_0.pkl |
0.3907 |
tgkf_train_1.pkl |
0.3914 |
tgkf_train_2.pkl |
0.3829 |
All models predictions was averaged and then rounded to 2 decimals with additional heuristics:
...
def is_stackexchange(url: str) -> bool:
return ("ell.stackexchange.com" in url) or ("english.stackexchange.com" in url)
def truncate(n, decimals=0):
multiplier = 10 ** decimals
return int(n * multiplier) / multiplier
def round_up(n, decimals=0):
multiplier = 10 ** decimals
return math.ceil(n * multiplier) / multiplier
def th_round(val, threshold=0.5, decimals=2):
return round_up(val, decimals) if val > threshold else truncate(val, decimals)
...
sub = ps.read_csv(data_dir / "sample_submission.csv")
sub[targets] = avg_preds
for col in targets:
sub.loc[sub[col] >= 0.99, col] = 0.99
sub.loc[sub[col] <= 0.01, col] = 0.01
sub[col] = sub[col].apply(th_round)
sub["question_type_spelling"] = test_df["url"].apply(lambda u: 0.5 if is_stackexchange(u) else 0.0)