Open
Description
我在训练自己的模型时发生异常,辛苦大佬看一下如何解决。
环境情况:
Python 3.10.11
ddparser 1.0.8
LAC 2.1.2
paddlepaddle 2.4.2
paddlepaddle-gpu 2.4.2.post117
数据情况:
train.txt和dev.txt都是从官方的test.txt中截取出来的,train.txt随意选了10条,dev.txt是8条,test.txt是2条。
(train.txt中保证至少有一个符号出现过2次)
启动命令:sh run_train.sh
启动前修改了run_train.sh,增加了 --punct 参数。
运行结果:
(paddle_env) sh run_train.sh
+ python -u run.py --mode=train --use_cuda --feat=none --preprocess --model_files=model_files/baidu --train_data_path=data/baidu/train.txt --valid_data_path=data/baidu/dev.txt --test_data_path=data/baidu/test.txt --encoding_model=ernie-lstm --buckets=15 --punct
/home/haipi/.conda/envs/paddle_env/lib/python3.10/site-packages/pkg_resources/__init__.py:121: DeprecationWarning: pkg_resources is deprecated as an API
warnings.warn("pkg_resources is deprecated as an API", DeprecationWarning)
/home/haipi/.conda/envs/paddle_env/lib/python3.10/site-packages/pkg_resources/__init__.py:2870: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
[2023-06-19 18:39:44,412] [ INFO] config.py:214 - Preprocess the data
[2023-06-19 18:39:44,412] [ INFO] tokenizing_ernie.py:92 - get pretrain dir from https://ernie-github.cdn.bcebos.com/model-ernie1.0.1.tar.gz
[2023-06-19 18:39:44,422] [ INFO] config.py:273 - dumping fileds to disk.
[2023-06-19 18:39:44,436] [ INFO] run.py:480 - Override the default configs
--------------------------+--------------------------
Param | Value
--------------------------+--------------------------
n_embed | 300
embed_dropout | 0.33
n_mlp_arc | 500
n_mlp_rel | 100
mlp_dropout | 0.33
n_feat_embed | 60
n_char_embed | 50
n_lstm_feat_embed | 100
n_lstm_hidden | 300
n_tran_hidden | 300
n_lstm_layers | 3
lstm_dropout | 0.33
n_tran_feat_embed | 120
n_tran_feat_head | 12
n_tran_feat_layer | 2
n_tran_word_head | 12
n_tran_word_layer | 3
warmup_proportion | 0.1
weight_decay | 0.01
lstm_by_wp_embed_size | 200
lstm_lr | 0.002
ernie_lr | 5e-05
mu | 0.9
nu | 0.9
epsilon | 1e-12
decay | 0.75
decay_steps | 5000
epochs | 50000
patience | 30
min_freq | 2
fix_len | 20
clip | 1.0
mode | train
config_path | config.ini
model_files | model_files/baidu
train_data_path | data/baidu/train.txt
valid_data_path | data/baidu/dev.txt
test_data_path | data/baidu/test.txt
infer_data_path | None
batch_size | 1000
log_path | ./log/log
log_level | INFO
infer_result_path | infer_result
use_cuda | True
preprocess | True
use_data_parallel | False
seed | 1
threads | 16
tree | False
prob | False
feat | none
encoding_model | ernie-lstm
buckets | 15
punct | True
None | False
nranks | 1
local_rank | 0
fields_path | model_files/baidu/fields
model_path | model_files/baidu/model
ernie_vocabs_size | 17964
n_words | 17964
n_feats | None
n_rels | 12
pad_index | 0
unk_index | 17963
bos_index | 1
eos_index | 2
feat_pad_index | None
--------------------------+--------------------------
[2023-06-19 18:39:44,437] [ INFO] run.py:481 - (word): ErnieField(pad=[PAD], unk=[UNK], bos=[CLS], eos=[SEP])
None
(head): Field(bos=<bos>, eos=<eos>, use_vocab=False)
(deprel): Field(bos=<bos>, eos=<eos>)
[2023-06-19 18:39:44,437] [ INFO] run.py:482 - Set the max num of threads to 16
[2023-06-19 18:39:44,437] [ INFO] run.py:483 - Set the seed for generating random numbers to 1
[2023-06-19 18:39:44,437] [ INFO] run.py:484 - Run the subcommand in mode train
[2023-06-19 18:39:44,437] [ INFO] run.py:71 - loading data.
[2023-06-19 18:39:44,437] [ INFO] run.py:75 - init dataset.
[2023-06-19 18:39:44,440] [ INFO] run.py:79 - set the data loaders.
[2023-06-19 18:39:44,440] [ INFO] run.py:84 - train: 18 sentences, 7 batches, 7 buckets
[2023-06-19 18:39:44,440] [ INFO] run.py:86 - dev: 7 sentences, 4 batches, 4 buckets
[2023-06-19 18:39:44,440] [ INFO] run.py:88 - test: 1 sentences, 1 batches, 1 buckets
[2023-06-19 18:39:44,440] [ INFO] run.py:91 - Create the model
W0619 18:39:44.442440 248516 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.7, Runtime API Version: 11.7
W0619 18:39:44.448432 248516 gpu_resources.cc:91] device: 0, cuDNN Version: 8.5.
[2023-06-19 18:39:45,551] [ INFO] run.py:134 - start training.
[2023-06-19 18:39:45,551] [ INFO] run.py:139 - Epoch 1 / 50000:
/home/haipi/.conda/envs/paddle_env/lib/python3.10/site-packages/paddle/fluid/dygraph/math_op_patch.py:275: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.int64, but right dtype is paddle.int32, the right dtype will convert to paddle.int64
warnings.warn(
/home/haipi/.conda/envs/paddle_env/lib/python3.10/site-packages/paddle/fluid/framework.py:4002: DeprecationWarning: Op `cumsum` is executed through `append_op` under the dynamic mode, the corresponding API implementation needs to be upgraded to using `_C_ops` method.
warnings.warn(
/home/haipi/.conda/envs/paddle_env/lib/python3.10/site-packages/paddle/fluid/dygraph/math_op_patch.py:275: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.int64, but right dtype is paddle.bool, the right dtype will convert to paddle.int64
warnings.warn(
Could not load library libcudnn_adv_train.so.8. Error: /home/haipi/.conda/envs/paddle_env/bin/../lib/libcudnn_ops_train.so.8: symbol _ZN5cudnn3ops26JoinInternalPriorityStreamEP12cudnnContexti, version libcudnn_ops_infer.so.8 not defined in file libcudnn_ops_infer.so.8 with link time reference
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 rnn_dygraph_function(paddle::experimental::Tensor const&, std::vector<paddle::experimental::Tensor, std::allocator<paddle::experimental::Tensor> > const&, std::vector<paddle::experimental::Tensor, std::allocator<paddle::experimental::Tensor> > const&, paddle::experimental::Tensor const&, paddle::experimental::Tensor*, unsigned long, paddle::framework::AttributeMap const&)
1 paddle::imperative::Tracer::TraceOp(std::string const&, paddle::imperative::NameTensorMap const&, paddle::imperative::NameTensorMap const&, paddle::framework::AttributeMap&, phi::Place const&, paddle::framework::AttributeMap*, bool, std::map<std::string, std::string, std::less<std::string >, std::allocator<std::pair<std::string const, std::string > > > const&)
2 void paddle::imperative::Tracer::TraceOpImpl<egr::EagerVariable>(std::string const&, paddle::imperative::details::NameVarMapTrait<egr::EagerVariable>::Type const&, paddle::imperative::details::NameVarMapTrait<egr::EagerVariable>::Type const&, paddle::framework::AttributeMap&, phi::Place const&, bool, std::map<std::string, std::string, std::less<std::string >, std::allocator<std::pair<std::string const, std::string > > > const&, paddle::framework::AttributeMap*, bool)
3 paddle::imperative::PreparedOp::Run(paddle::imperative::NameTensorMap const&, paddle::imperative::NameTensorMap const&, paddle::framework::AttributeMap const&, paddle::framework::AttributeMap const&)
4 phi::KernelImpl<void (*)(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, paddle::optional<phi::DenseTensor> const&, float, bool, int, int, int, std::string const&, int, bool, phi::DenseTensor*, phi::DenseTensor*, std::vector<phi::DenseTensor*, std::allocator<phi::DenseTensor*> >, phi::DenseTensor*), &(void phi::RnnKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, paddle::optional<phi::DenseTensor> const&, float, bool, int, int, int, std::string const&, int, bool, phi::DenseTensor*, phi::DenseTensor*, std::vector<phi::DenseTensor*, std::allocator<phi::DenseTensor*> >, phi::DenseTensor*))>::Compute(phi::KernelContext*)
5 void phi::KernelImpl<void (*)(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, paddle::optional<phi::DenseTensor> const&, float, bool, int, int, int, std::string const&, int, bool, phi::DenseTensor*, phi::DenseTensor*, std::vector<phi::DenseTensor*, std::allocator<phi::DenseTensor*> >, phi::DenseTensor*), &(void phi::RnnKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, paddle::optional<phi::DenseTensor> const&, float, bool, int, int, int, std::string const&, int, bool, phi::DenseTensor*, phi::DenseTensor*, std::vector<phi::DenseTensor*, std::allocator<phi::DenseTensor*> >, phi::DenseTensor*))>::KernelCallHelper<paddle::optional<phi::DenseTensor> const&, float, bool, int, int, int, std::string const&, int, bool, phi::DenseTensor*, phi::DenseTensor*, std::vector<phi::DenseTensor*, std::allocator<phi::DenseTensor*> >, phi::DenseTensor*, phi::TypeTag<int> >::Compute<1, 3, 0, 0, phi::GPUContext const, phi::DenseTensor const, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> >, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > >(phi::KernelContext*, phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> >&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> >&)
6 void phi::RnnKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, paddle::optional<phi::DenseTensor> const&, float, bool, int, int, int, std::string const&, int, bool, phi::DenseTensor*, phi::DenseTensor*, std::vector<phi::DenseTensor*, std::allocator<phi::DenseTensor*> >, phi::DenseTensor*)
7 cudnnRNNForwardTrainingEx
----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
[TimeInfo: *** Aborted at 1687171186 (unix time) try "date -d @1687171186" if you are using GNU date ***]
[SignalInfo: *** SIGABRT (@0x3ea0003cac4) received by PID 248516 (TID 0x7f6d72607740) from PID 248516 ***]
run_train.sh: line 19: 248516 Aborted python -u run.py --mode=train --use_cuda --feat=none --preprocess --model_files=model_files/baidu --train_data_path=data/baidu/train.txt --valid_data_path=data/baidu/dev.txt --test_data_path=data/baidu/test.txt --encoding_model=ernie-lstm --buckets=15 --punct
(paddle_env)
Metadata
Assignees
Labels
No labels
Activity