Skip to content

单机多卡全参数训练LLAMA3,报错warmup_steps must be either 0 or > 1 #4005

Closed
@ZhuYanzhen1

Description

Reminder

  • I have read the README and searched the existing issues.

Reproduction

我使用命令./train.sh发起对LLAMA3-70B的全参数训练,我使用的显卡是3张 A100-SXM4-40GB,以下是train.sh的内容。

#!/bin/bash

NPROC_PER_NODE=3
NNODES=1
RANK=0
MASTER_ADDR=127.0.0.1
MASTER_PORT=29500

CUDA_VISIBLE_DEVICES=0,1,2 torchrun \
        --nproc_per_node $NPROC_PER_NODE \
        --nnodes $NNODES \
        --node_rank $RANK \
        --master_addr $MASTER_ADDR \
        --master_port $MASTER_PORT \
        ../llama/src/train.py llama3_sft_multi.yaml

以下是llama3_sft_multi.yaml的内容,其中model_name_or_path一项我设置为了本地的模型。该模型是从Meta官网下载的LLAMA3-Instruct模型的pth文件经由transformers脚本转换后得到的:

### model
model_name_or_path: /docker/llama3_70b_instruct

### method
stage: sft
do_train: true
finetuning_type: full

### ddp
ddp_timeout: 180000000
deepspeed: deepspeed_z3_config.json

### dataset
dataset: identity,alpaca_en_demo
template: llama3
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: /docker/llama3_70b_sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 0.0001
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_steps: 0.1
fp16: true

### eval
val_size: 0.1
per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 500

以下是deepspeed_z3_config.json的内容:

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  }
}

运行./train.sh后报以下错误:

[2024-05-31 12:38:07,473] torch.distributed.run: [WARNING] 
[2024-05-31 12:38:07,473] torch.distributed.run: [WARNING] *****************************************
[2024-05-31 12:38:07,473] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-05-31 12:38:07,473] torch.distributed.run: [WARNING] *****************************************
[2024-05-31 12:38:11,586] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-31 12:38:11,595] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-31 12:38:11,599] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py:1483: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
[2024-05-31 12:38:13,327] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-31 12:38:13,327] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py:1483: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
[2024-05-31 12:38:13,451] [INFO] [comm.py:637:init_distributed] cdb=None
/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py:1483: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
[2024-05-31 12:38:13,458] [INFO] [comm.py:637:init_distributed] cdb=None
Traceback (most recent call last):
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 14, in <module>
    main()
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 5, in main
    run_exp()
  File "/home/student_zyz/Desktop/llama/src/llamafactory/train/tuner.py", line 28, in run_exp
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 126, in get_train_args
    model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 112, in _parse_train_args
    return _parse_args(parser, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 42, in _parse_args
    return parser.parse_yaml_file(os.path.abspath(sys.argv[1]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 423, in parse_yaml_file
    outputs = self.parse_dict(yaml.safe_load(Path(yaml_file).read_text()), allow_extra_keys=allow_extra_keys)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 374, in parse_dict
    obj = dtype(**inputs)
          ^^^^^^^^^^^^^^^
  File "<string>", line 133, in __init__
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py", line 1801, in __post_init__
    raise ValueError("warmup_steps must be either 0 or > 1")
ValueError: warmup_steps must be either 0 or > 1
Traceback (most recent call last):
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 14, in <module>
    main()
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 5, in main
    run_exp()
  File "/home/student_zyz/Desktop/llama/src/llamafactory/train/tuner.py", line 28, in run_exp
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 126, in get_train_args
    model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 112, in _parse_train_args
    return _parse_args(parser, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 42, in _parse_args
    return parser.parse_yaml_file(os.path.abspath(sys.argv[1]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 423, in parse_yaml_file
    outputs = self.parse_dict(yaml.safe_load(Path(yaml_file).read_text()), allow_extra_keys=allow_extra_keys)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 374, in parse_dict
    obj = dtype(**inputs)
          ^^^^^^^^^^^^^^^
  File "<string>", line 133, in __init__
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py", line 1801, in __post_init__
    raise ValueError("warmup_steps must be either 0 or > 1")
ValueError: warmup_steps must be either 0 or > 1
Traceback (most recent call last):
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 14, in <module>
    main()
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 5, in main
    run_exp()
  File "/home/student_zyz/Desktop/llama/src/llamafactory/train/tuner.py", line 28, in run_exp
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 126, in get_train_args
    model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 112, in _parse_train_args
    return _parse_args(parser, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 42, in _parse_args
    return parser.parse_yaml_file(os.path.abspath(sys.argv[1]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 423, in parse_yaml_file
    outputs = self.parse_dict(yaml.safe_load(Path(yaml_file).read_text()), allow_extra_keys=allow_extra_keys)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 374, in parse_dict
    obj = dtype(**inputs)
          ^^^^^^^^^^^^^^^
  File "<string>", line 133, in __init__
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py", line 1801, in __post_init__
    raise ValueError("warmup_steps must be either 0 or > 1")
ValueError: warmup_steps must be either 0 or > 1
[2024-05-31 12:38:17,477] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 4060611) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/home/student_zyz/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
../llama/src/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-05-31_12:38:17
  host      : edaserver01
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 4060612)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-05-31_12:38:17
  host      : edaserver01
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 4060613)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-31_12:38:17
  host      : edaserver01
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 4060611)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Expected behavior

使用三张显卡进行LLAMA3-70B的全参量训练

System Info

  • transformers version: 4.42.0.dev0
  • Platform: Linux-5.15.0-107-generic-x86_64-with-glibc2.31
  • Python version: 3.11.9
  • Huggingface_hub version: 0.23.1
  • Safetensors version: 0.4.3
  • Accelerate version: 0.29.3
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Others

No response

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    solvedThis problem has been already solved

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions