Skip to content

full+reward微调训练,添加--save_safetensors False后会删除pytorch_model.bin #5305

Closed
@aistream69

Description

Reminder

  • I have read the README and searched the existing issues.

System Info

full+reward模式,Qwen1.5-0.5B-Chat微调训练时,如果不添加--save_safetensors会报错:
RuntimeError: Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'model.embed_tokens.weight', 'lm_head.weight'}].
添加--save_safetensors后虽然不再报错,但保存模型时src/llamafactory/train/callbacks.py的函数fix_valuehead_checkpoint内os.remove(path_to_checkpoint)会删除pytorch_model.bin,导致保存的模型无法使用,请问该如何解决?谢谢.

Reproduction

llamafactory-cli train --stage rm --do_train True --model_name_or_path models/Qwen1.5-0.5B-Chat --preprocessing_num_workers 16 --finetuning_type full --quantization_method bitsandbytes --template qwen --flash_attn auto --dataset_dir data --dataset dpo_en_demo --cutoff_len 256 --learning_rate 0.0002 --num_train_epochs 3.0 --max_samples 500 --per_device_train_batch_size 2 --gradient_accumulation_steps 4 --lr_scheduler_type cosine --max_grad_norm 1.0 --logging_steps 5 --save_steps 100 --warmup_steps 0 --optim adamw_torch --packing False --report_to none --output_dir saves/Qwen1.5-0.5B-Chat/full_rm --bf16 True --plot_loss True --ddp_timeout 180000000 --include_num_input_tokens_seen True --save_safetensors False

Expected behavior

No response

Others

No response

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    solvedThis problem has been already solved

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions