Skip to content

Error when inference using vllm with multi-gpu #4780

Closed
@hzhaoy

Description

Reminder

  • I have read the README and searched the existing issues.

System Info

System: Ubuntu 20.04.2 LTS
GPU: NVIDIA A100-SXM4-80GB
Docker: 24.0.0
Docker Compose: v2.17.3
llamafactory: 0.8.2.dev0
vllm: 0.5.1

Reproduction

Dockerfile: https://github.com/hiyouga/LLaMA-Factory/blob/67040f149c0b3fbae443ba656ed0dcab0ebaf730/docker/docker-cuda/Dockerfile

Build Command:

docker build -f ./Dockerfile \
    --build-arg INSTALL_BNB=true \
    --build-arg INSTALL_VLLM=true \
    --build-arg INSTALL_DEEPSPEED=true \
    --build-arg INSTALL_FLASHATTN=true \
    --build-arg PIP_INDEX=https://pypi.tuna.tsinghua.edu.cn/simple \
    -t llamafactory:latest .

Launch Command:

docker run -dit --gpus=all \
    -v ./hf_cache:/root/.cache/huggingface \
    -v ./ms_cache:/root/.cache/modelscope \
    -v ./data:/app/data \
    -v ./output:/app/output \
    -p 7860:7860 \
    -p 8000:8000 \
    --shm-size 16G \
    --name llamafactory \
    llamafactory:latest

docker exec -it llamafactory bash

llamafactory-cli webui

The error below occurs when loading Qwen2-7B-Instruct in the chat tab of webui using vllm with multi-gpu:

(VllmWorkerProcess pid=263) Process VllmWorkerProcess:
(VllmWorkerProcess pid=263) Traceback (most recent call last):
(VllmWorkerProcess pid=263)   File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(VllmWorkerProcess pid=263)     self.run()
(VllmWorkerProcess pid=263)   File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
(VllmWorkerProcess pid=263)     self._target(*self._args, **self._kwargs)
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 210, in _run_worker_process
(VllmWorkerProcess pid=263)     worker = worker_factory()
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 68, in _create_worker
(VllmWorkerProcess pid=263)     wrapper.init_worker(**self._get_worker_kwargs(local_rank, rank,
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 334, in init_worker
(VllmWorkerProcess pid=263)     self.worker = worker_class(*args, **kwargs)
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 85, in __init__
(VllmWorkerProcess pid=263)     self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 217, in __init__
(VllmWorkerProcess pid=263)     self.attn_backend = get_attn_backend(
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 45, in get_attn_backend
(VllmWorkerProcess pid=263)     backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 151, in which_attn_to_use
(VllmWorkerProcess pid=263)     if torch.cuda.get_device_capability()[0] < 8:
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
(VllmWorkerProcess pid=263)     prop = get_device_properties(device)
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
(VllmWorkerProcess pid=263)     _lazy_init()  # will define _get_device_properties
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 279, in _lazy_init
(VllmWorkerProcess pid=263)     raise RuntimeError(
(VllmWorkerProcess pid=263) RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
ERROR 07-11 13:53:53 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 263 died, exit code: 1
INFO 07-11 13:53:53 multiproc_worker_utils.py:123] Killing local vLLM worker processes

Expected behavior

Successfully loading model using vllm with multi-gpu.

Others

No response

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    solvedThis problem has been already solved

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions