Closed
Description
Reminder
- I have read the README and searched the existing issues.
Reproduction
我使用命令./train.sh
发起对LLAMA3-70B的全参数训练,我使用的显卡是3张 A100-SXM4-40GB,以下是train.sh的内容。
#!/bin/bash
NPROC_PER_NODE=3
NNODES=1
RANK=0
MASTER_ADDR=127.0.0.1
MASTER_PORT=29500
CUDA_VISIBLE_DEVICES=0,1,2 torchrun \
--nproc_per_node $NPROC_PER_NODE \
--nnodes $NNODES \
--node_rank $RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
../llama/src/train.py llama3_sft_multi.yaml
以下是llama3_sft_multi.yaml的内容,其中model_name_or_path
一项我设置为了本地的模型。该模型是从Meta官网下载的LLAMA3-Instruct模型的pth文件经由transformers脚本转换后得到的:
### model
model_name_or_path: /docker/llama3_70b_instruct
### method
stage: sft
do_train: true
finetuning_type: full
### ddp
ddp_timeout: 180000000
deepspeed: deepspeed_z3_config.json
### dataset
dataset: identity,alpaca_en_demo
template: llama3
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
### output
output_dir: /docker/llama3_70b_sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 0.0001
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_steps: 0.1
fp16: true
### eval
val_size: 0.1
per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 500
以下是deepspeed_z3_config.json
的内容:
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}
运行./train.sh
后报以下错误:
[2024-05-31 12:38:07,473] torch.distributed.run: [WARNING]
[2024-05-31 12:38:07,473] torch.distributed.run: [WARNING] *****************************************
[2024-05-31 12:38:07,473] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-05-31 12:38:07,473] torch.distributed.run: [WARNING] *****************************************
[2024-05-31 12:38:11,586] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-31 12:38:11,595] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-31 12:38:11,599] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
[WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
[WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
[WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible
/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py:1483: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
warnings.warn(
[2024-05-31 12:38:13,327] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-31 12:38:13,327] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py:1483: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
warnings.warn(
[2024-05-31 12:38:13,451] [INFO] [comm.py:637:init_distributed] cdb=None
/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py:1483: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
warnings.warn(
[2024-05-31 12:38:13,458] [INFO] [comm.py:637:init_distributed] cdb=None
Traceback (most recent call last):
File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 14, in <module>
main()
File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 5, in main
run_exp()
File "/home/student_zyz/Desktop/llama/src/llamafactory/train/tuner.py", line 28, in run_exp
model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
^^^^^^^^^^^^^^^^^^^^
File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 126, in get_train_args
model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 112, in _parse_train_args
return _parse_args(parser, args)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 42, in _parse_args
return parser.parse_yaml_file(os.path.abspath(sys.argv[1]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 423, in parse_yaml_file
outputs = self.parse_dict(yaml.safe_load(Path(yaml_file).read_text()), allow_extra_keys=allow_extra_keys)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 374, in parse_dict
obj = dtype(**inputs)
^^^^^^^^^^^^^^^
File "<string>", line 133, in __init__
File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py", line 1801, in __post_init__
raise ValueError("warmup_steps must be either 0 or > 1")
ValueError: warmup_steps must be either 0 or > 1
Traceback (most recent call last):
File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 14, in <module>
main()
File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 5, in main
run_exp()
File "/home/student_zyz/Desktop/llama/src/llamafactory/train/tuner.py", line 28, in run_exp
model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
^^^^^^^^^^^^^^^^^^^^
File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 126, in get_train_args
model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 112, in _parse_train_args
return _parse_args(parser, args)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 42, in _parse_args
return parser.parse_yaml_file(os.path.abspath(sys.argv[1]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 423, in parse_yaml_file
outputs = self.parse_dict(yaml.safe_load(Path(yaml_file).read_text()), allow_extra_keys=allow_extra_keys)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 374, in parse_dict
obj = dtype(**inputs)
^^^^^^^^^^^^^^^
File "<string>", line 133, in __init__
File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py", line 1801, in __post_init__
raise ValueError("warmup_steps must be either 0 or > 1")
ValueError: warmup_steps must be either 0 or > 1
Traceback (most recent call last):
File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 14, in <module>
main()
File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 5, in main
run_exp()
File "/home/student_zyz/Desktop/llama/src/llamafactory/train/tuner.py", line 28, in run_exp
model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
^^^^^^^^^^^^^^^^^^^^
File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 126, in get_train_args
model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 112, in _parse_train_args
return _parse_args(parser, args)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 42, in _parse_args
return parser.parse_yaml_file(os.path.abspath(sys.argv[1]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 423, in parse_yaml_file
outputs = self.parse_dict(yaml.safe_load(Path(yaml_file).read_text()), allow_extra_keys=allow_extra_keys)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 374, in parse_dict
obj = dtype(**inputs)
^^^^^^^^^^^^^^^
File "<string>", line 133, in __init__
File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py", line 1801, in __post_init__
raise ValueError("warmup_steps must be either 0 or > 1")
ValueError: warmup_steps must be either 0 or > 1
[2024-05-31 12:38:17,477] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 4060611) of binary: /usr/bin/python
Traceback (most recent call last):
File "/home/student_zyz/.local/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
../llama/src/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-05-31_12:38:17
host : edaserver01
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 4060612)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-05-31_12:38:17
host : edaserver01
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 4060613)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-05-31_12:38:17
host : edaserver01
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 4060611)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Expected behavior
使用三张显卡进行LLAMA3-70B的全参量训练
System Info
transformers
version: 4.42.0.dev0- Platform: Linux-5.15.0-107-generic-x86_64-with-glibc2.31
- Python version: 3.11.9
- Huggingface_hub version: 0.23.1
- Safetensors version: 0.4.3
- Accelerate version: 0.29.3
- Accelerate config: not found
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Others
No response
Activity