Skip to content

Long context full SFT validation causes OOM #7041

Open
@Yixi-Rao

Description

Reminder

  • I have read the above rules and searched the existing issues.

System Info

I am doing long context full SFT, I can enable finetuning by setting:

bf16: true
gradient_checkpointing: true
disable_gradient_checkpointing: false
gradient_checkpointing: true
disable_gradient_checkpointing: false
enable_liger_kernel: true
use_unsloth_gc: true
flash_attn: fa2
torch_empty_cache_steps: 10

but I found that OOM happened during the validation stage, I have already set the batch size == 1

# eval
val_size: 0.02
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 24

I have to set up validation during SFT due to the specific task I am fine-tuning.
Are there any ways or suggestions to solve this validation OOM problem?
Thanks in advance!

Reproduction

model

model_name_or_path:

method

stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json

dataset

dataset:
template: qwen
cutoff_len: 120000
overwrite_cache: true
preprocessing_num_workers: 90

output

output_dir:
report_to: tensorboard
logging_dir:
logging_steps: 1
save_steps: 190
plot_loss: true
overwrite_output_dir: true

train

per_device_train_batch_size: 1
gradient_accumulation_steps: 4
learning_rate: 1.0e-6
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
max_grad_norm: 1.0
bf16: true

gradient_checkpointing: true
disable_gradient_checkpointing: false

enable_liger_kernel: true
use_unsloth_gc: true

flash_attn: fa2

torch_empty_cache_steps: 10

ddp_timeout: 180000000
save_only_model: true

eval

val_size: 0.02
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 24

Others

No response

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpendingThis problem is yet to be addressed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions