训练自己的模型时发生异常

我在训练自己的模型时发生异常，辛苦大佬看一下如何解决。

环境情况：
```
Python    3.10.11
ddparser  1.0.8
LAC       2.1.2
paddlepaddle     2.4.2
paddlepaddle-gpu 2.4.2.post117
```

数据情况：
train.txt和dev.txt都是从官方的test.txt中截取出来的，train.txt随意选了10条，dev.txt是8条，test.txt是2条。
（train.txt中保证至少有一个符号出现过2次）

启动命令：`sh run_train.sh`
启动前修改了run_train.sh，增加了 --punct 参数。

运行结果：
```
(paddle_env) sh run_train.sh
+ python -u run.py --mode=train --use_cuda --feat=none --preprocess --model_files=model_files/baidu --train_data_path=data/baidu/train.txt --valid_data_path=data/baidu/dev.txt --test_data_path=data/baidu/test.txt --encoding_model=ernie-lstm --buckets=15 --punct
/home/haipi/.conda/envs/paddle_env/lib/python3.10/site-packages/pkg_resources/__init__.py:121: DeprecationWarning: pkg_resources is deprecated as an API
  warnings.warn("pkg_resources is deprecated as an API", DeprecationWarning)
/home/haipi/.conda/envs/paddle_env/lib/python3.10/site-packages/pkg_resources/__init__.py:2870: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
[2023-06-19 18:39:44,412] [    INFO] config.py:214 - Preprocess the data
[2023-06-19 18:39:44,412] [    INFO] tokenizing_ernie.py:92 - get pretrain dir from https://ernie-github.cdn.bcebos.com/model-ernie1.0.1.tar.gz
[2023-06-19 18:39:44,422] [    INFO] config.py:273 - dumping fileds to disk.
[2023-06-19 18:39:44,436] [    INFO] run.py:480 - Override the default configs
--------------------------+--------------------------
Param                     |           Value          
--------------------------+--------------------------
n_embed                   |            300           
embed_dropout             |           0.33           
n_mlp_arc                 |            500           
n_mlp_rel                 |            100           
mlp_dropout               |           0.33           
n_feat_embed              |            60            
n_char_embed              |            50            
n_lstm_feat_embed         |            100           
n_lstm_hidden             |            300           
n_tran_hidden             |            300           
n_lstm_layers             |             3            
lstm_dropout              |           0.33           
n_tran_feat_embed         |            120           
n_tran_feat_head          |            12            
n_tran_feat_layer         |             2            
n_tran_word_head          |            12            
n_tran_word_layer         |             3            
warmup_proportion         |            0.1           
weight_decay              |           0.01           
lstm_by_wp_embed_size     |            200           
lstm_lr                   |           0.002          
ernie_lr                  |           5e-05          
mu                        |            0.9           
nu                        |            0.9           
epsilon                   |           1e-12          
decay                     |           0.75           
decay_steps               |           5000           
epochs                    |           50000          
patience                  |            30            
min_freq                  |             2            
fix_len                   |            20            
clip                      |            1.0           
mode                      |           train          
config_path               |        config.ini        
model_files               |     model_files/baidu    
train_data_path           |   data/baidu/train.txt   
valid_data_path           |    data/baidu/dev.txt    
test_data_path            |    data/baidu/test.txt   
infer_data_path           |           None           
batch_size                |           1000           
log_path                  |         ./log/log        
log_level                 |           INFO           
infer_result_path         |       infer_result       
use_cuda                  |           True           
preprocess                |           True           
use_data_parallel         |           False          
seed                      |             1            
threads                   |            16            
tree                      |           False          
prob                      |           False          
feat                      |           none           
encoding_model            |        ernie-lstm        
buckets                   |            15            
punct                     |           True           
None                      |           False          
nranks                    |             1            
local_rank                |             0            
fields_path               | model_files/baidu/fields 
model_path                |  model_files/baidu/model 
ernie_vocabs_size         |           17964          
n_words                   |           17964          
n_feats                   |           None           
n_rels                    |            12            
pad_index                 |             0            
unk_index                 |           17963          
bos_index                 |             1            
eos_index                 |             2            
feat_pad_index            |           None           
--------------------------+--------------------------

[2023-06-19 18:39:44,437] [    INFO] run.py:481 - (word): ErnieField(pad=[PAD], unk=[UNK], bos=[CLS], eos=[SEP])
None
(head): Field(bos=<bos>, eos=<eos>, use_vocab=False)
(deprel): Field(bos=<bos>, eos=<eos>)
[2023-06-19 18:39:44,437] [    INFO] run.py:482 - Set the max num of threads to 16
[2023-06-19 18:39:44,437] [    INFO] run.py:483 - Set the seed for generating random numbers to 1
[2023-06-19 18:39:44,437] [    INFO] run.py:484 - Run the subcommand in mode train
[2023-06-19 18:39:44,437] [    INFO] run.py:71 - loading data.
[2023-06-19 18:39:44,437] [    INFO] run.py:75 - init dataset.
[2023-06-19 18:39:44,440] [    INFO] run.py:79 - set the data loaders.
[2023-06-19 18:39:44,440] [    INFO] run.py:84 - train:    18 sentences,   7 batches, 7 buckets
[2023-06-19 18:39:44,440] [    INFO] run.py:86 - dev:       7 sentences,   4 batches, 4 buckets
[2023-06-19 18:39:44,440] [    INFO] run.py:88 - test:      1 sentences,   1 batches, 1 buckets
[2023-06-19 18:39:44,440] [    INFO] run.py:91 - Create the model
W0619 18:39:44.442440 248516 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.7, Runtime API Version: 11.7
W0619 18:39:44.448432 248516 gpu_resources.cc:91] device: 0, cuDNN Version: 8.5.
[2023-06-19 18:39:45,551] [    INFO] run.py:134 - start training.
[2023-06-19 18:39:45,551] [    INFO] run.py:139 - Epoch 1 / 50000:
/home/haipi/.conda/envs/paddle_env/lib/python3.10/site-packages/paddle/fluid/dygraph/math_op_patch.py:275: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.int64, but right dtype is paddle.int32, the right dtype will convert to paddle.int64
  warnings.warn(
/home/haipi/.conda/envs/paddle_env/lib/python3.10/site-packages/paddle/fluid/framework.py:4002: DeprecationWarning: Op `cumsum` is executed through `append_op` under the dynamic mode, the corresponding API implementation needs to be upgraded to using `_C_ops` method.
  warnings.warn(
/home/haipi/.conda/envs/paddle_env/lib/python3.10/site-packages/paddle/fluid/dygraph/math_op_patch.py:275: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.int64, but right dtype is paddle.bool, the right dtype will convert to paddle.int64
  warnings.warn(
Could not load library libcudnn_adv_train.so.8. Error: /home/haipi/.conda/envs/paddle_env/bin/../lib/libcudnn_ops_train.so.8: symbol _ZN5cudnn3ops26JoinInternalPriorityStreamEP12cudnnContexti, version libcudnn_ops_infer.so.8 not defined in file libcudnn_ops_infer.so.8 with link time reference


--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   rnn_dygraph_function(paddle::experimental::Tensor const&, std::vector<paddle::experimental::Tensor, std::allocator<paddle::experimental::Tensor> > const&, std::vector<paddle::experimental::Tensor, std::allocator<paddle::experimental::Tensor> > const&, paddle::experimental::Tensor const&, paddle::experimental::Tensor*, unsigned long, paddle::framework::AttributeMap const&)
1   paddle::imperative::Tracer::TraceOp(std::string const&, paddle::imperative::NameTensorMap const&, paddle::imperative::NameTensorMap const&, paddle::framework::AttributeMap&, phi::Place const&, paddle::framework::AttributeMap*, bool, std::map<std::string, std::string, std::less<std::string >, std::allocator<std::pair<std::string const, std::string > > > const&)
2   void paddle::imperative::Tracer::TraceOpImpl<egr::EagerVariable>(std::string const&, paddle::imperative::details::NameVarMapTrait<egr::EagerVariable>::Type const&, paddle::imperative::details::NameVarMapTrait<egr::EagerVariable>::Type const&, paddle::framework::AttributeMap&, phi::Place const&, bool, std::map<std::string, std::string, std::less<std::string >, std::allocator<std::pair<std::string const, std::string > > > const&, paddle::framework::AttributeMap*, bool)
3   paddle::imperative::PreparedOp::Run(paddle::imperative::NameTensorMap const&, paddle::imperative::NameTensorMap const&, paddle::framework::AttributeMap const&, paddle::framework::AttributeMap const&)
4   phi::KernelImpl<void (*)(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, paddle::optional<phi::DenseTensor> const&, float, bool, int, int, int, std::string const&, int, bool, phi::DenseTensor*, phi::DenseTensor*, std::vector<phi::DenseTensor*, std::allocator<phi::DenseTensor*> >, phi::DenseTensor*), &(void phi::RnnKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, paddle::optional<phi::DenseTensor> const&, float, bool, int, int, int, std::string const&, int, bool, phi::DenseTensor*, phi::DenseTensor*, std::vector<phi::DenseTensor*, std::allocator<phi::DenseTensor*> >, phi::DenseTensor*))>::Compute(phi::KernelContext*)
5   void phi::KernelImpl<void (*)(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, paddle::optional<phi::DenseTensor> const&, float, bool, int, int, int, std::string const&, int, bool, phi::DenseTensor*, phi::DenseTensor*, std::vector<phi::DenseTensor*, std::allocator<phi::DenseTensor*> >, phi::DenseTensor*), &(void phi::RnnKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, paddle::optional<phi::DenseTensor> const&, float, bool, int, int, int, std::string const&, int, bool, phi::DenseTensor*, phi::DenseTensor*, std::vector<phi::DenseTensor*, std::allocator<phi::DenseTensor*> >, phi::DenseTensor*))>::KernelCallHelper<paddle::optional<phi::DenseTensor> const&, float, bool, int, int, int, std::string const&, int, bool, phi::DenseTensor*, phi::DenseTensor*, std::vector<phi::DenseTensor*, std::allocator<phi::DenseTensor*> >, phi::DenseTensor*, phi::TypeTag<int> >::Compute<1, 3, 0, 0, phi::GPUContext const, phi::DenseTensor const, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> >, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > >(phi::KernelContext*, phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> >&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> >&)
6   void phi::RnnKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, paddle::optional<phi::DenseTensor> const&, float, bool, int, int, int, std::string const&, int, bool, phi::DenseTensor*, phi::DenseTensor*, std::vector<phi::DenseTensor*, std::allocator<phi::DenseTensor*> >, phi::DenseTensor*)
7   cudnnRNNForwardTrainingEx

----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1687171186 (unix time) try "date -d @1687171186" if you are using GNU date ***]
  [SignalInfo: *** SIGABRT (@0x3ea0003cac4) received by PID 248516 (TID 0x7f6d72607740) from PID 248516 ***]

run_train.sh: line 19: 248516 Aborted                 python -u run.py --mode=train --use_cuda --feat=none --preprocess --model_files=model_files/baidu --train_data_path=data/baidu/train.txt --valid_data_path=data/baidu/dev.txt --test_data_path=data/baidu/test.txt --encoding_model=ernie-lstm --buckets=15 --punct
(paddle_env)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

训练自己的模型时发生异常 #74

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development