Closed
Description
Reminder
- I have read the README and searched the existing issues.
System Info
通过DeepSpeed训练Qwen1.5-1.8B,使用Zero2可以正常训练,但是使用Zero3报错。
Reproduction
执行命令:
CUDA_VISIBLE_DEVICES=5,6 llamafactory-cli train examples/lora_multi_gpu/qwen_lora_dpo_ds.yaml
训练配置文件:
### model
model_name_or_path: /home/Qwen1.5-1.8B
### method
stage: dpo
do_train: true
finetuning_type: lora
lora_target: q_proj,v_proj
### ddp
ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z3_config.json
### dataset
dataset: comparison_gpt4_zh
template: qwen
cutoff_len: 1024
max_samples: 100000
overwrite_cache: true
preprocessing_num_workers: 16
### output
output_dir: saves/Qwen1.5-1.8B/lora/dpo
logging_steps: 100
save_steps: 500
plot_loss: true
overwrite_output_dir: true
### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
learning_rate: 5.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: true
### eval
val_size: 0.1
per_device_eval_batch_size: 2
eval_strategy: steps
eval_steps: 500
Expected behavior
报错如下:
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/LLaMA-Factory/src/llamafactory/launcher.py", line 9, in <module>
[rank0]: launch()
[rank0]: File "/home/LLaMA-Factory/src/llamafactory/launcher.py", line 5, in launch
[rank0]: run_exp()
[rank0]: File "/home/LLaMA-Factory/src/llamafactory/train/tuner.py", line 39, in run_exp
[rank0]: run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
[rank0]: File "/home/LLaMA-Factory/src/llamafactory/train/dpo/workflow.py", line 64, in run_dpo
[rank0]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1885, in train
[rank0]: return inner_training_loop(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2216, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3238, in training_step
[rank0]: loss = self.compute_loss(model, inputs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/trl/trainer/dpo_trainer.py", line 1081, in compute_loss
[rank0]: loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
[rank0]: File "/home/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 207, in get_batch_loss_metrics
[rank0]: reference_chosen_logps, reference_rejected_logps = self.compute_reference_log_probs(model, batch)
[rank0]: File "/home/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 185, in compute_reference_log_probs
[rank0]: reference_chosen_logps, reference_rejected_logps, *_ = self.concatenated_forward(ref_model, batch)
[rank0]: File "/home/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 156, in concatenated_forward
[rank0]: all_logits: "torch.Tensor" = model(**batch, return_dict=True, use_cache=False).logits.to(torch.float32)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1852, in forward
[rank0]: loss = self.module(*inputs, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1582, in _call_impl
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 1430, in forward
[rank0]: return self.base_model(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1582, in _call_impl
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/peft/tuners/tuners_utils.py", line 179, in forward
[rank0]: return self.model.forward(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 1149, in forward
[rank0]: outputs = self.model(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1582, in _call_impl
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 978, in forward
[rank0]: inputs_embeds = self.embed_tokens(input_ids)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1571, in _call_impl
[rank0]: args_result = hook(self, args)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank0]: self.pre_sub_module_forward_function(module)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank0]: param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 316, in fetch_sub_module
[rank0]: assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
[rank0]: AssertionError: {'id': 0, 'status': 'INFLIGHT', 'numel': 311164928, 'ds_numel': 311164928, 'shape': (151936, 2048), 'ds_shape': (151936, 2048), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': {4}, 'ds_tensor.shape': torch.Size([155582464])}
Others
No response
Activity