Error when inference using vllm with multi-gpu

### Reminder

- [X] I have read the README and searched the existing issues.

### System Info

System: Ubuntu 20.04.2 LTS
GPU: NVIDIA A100-SXM4-80GB
Docker: 24.0.0
Docker Compose: v2.17.3
llamafactory: 0.8.2.dev0
vllm: 0.5.1

### Reproduction

Dockerfile: https://github.com/hiyouga/LLaMA-Factory/blob/67040f149c0b3fbae443ba656ed0dcab0ebaf730/docker/docker-cuda/Dockerfile

Build Command:
```
docker build -f ./Dockerfile \
    --build-arg INSTALL_BNB=true \
    --build-arg INSTALL_VLLM=true \
    --build-arg INSTALL_DEEPSPEED=true \
    --build-arg INSTALL_FLASHATTN=true \
    --build-arg PIP_INDEX=https://pypi.tuna.tsinghua.edu.cn/simple \
    -t llamafactory:latest .
```

Launch Command:
```
docker run -dit --gpus=all \
    -v ./hf_cache:/root/.cache/huggingface \
    -v ./ms_cache:/root/.cache/modelscope \
    -v ./data:/app/data \
    -v ./output:/app/output \
    -p 7860:7860 \
    -p 8000:8000 \
    --shm-size 16G \
    --name llamafactory \
    llamafactory:latest

docker exec -it llamafactory bash

llamafactory-cli webui
```
The error below occurs when loading Qwen2-7B-Instruct in the chat tab of webui using vllm with multi-gpu:
```
(VllmWorkerProcess pid=263) Process VllmWorkerProcess:
(VllmWorkerProcess pid=263) Traceback (most recent call last):
(VllmWorkerProcess pid=263)   File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(VllmWorkerProcess pid=263)     self.run()
(VllmWorkerProcess pid=263)   File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
(VllmWorkerProcess pid=263)     self._target(*self._args, **self._kwargs)
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 210, in _run_worker_process
(VllmWorkerProcess pid=263)     worker = worker_factory()
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 68, in _create_worker
(VllmWorkerProcess pid=263)     wrapper.init_worker(**self._get_worker_kwargs(local_rank, rank,
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 334, in init_worker
(VllmWorkerProcess pid=263)     self.worker = worker_class(*args, **kwargs)
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 85, in __init__
(VllmWorkerProcess pid=263)     self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 217, in __init__
(VllmWorkerProcess pid=263)     self.attn_backend = get_attn_backend(
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 45, in get_attn_backend
(VllmWorkerProcess pid=263)     backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 151, in which_attn_to_use
(VllmWorkerProcess pid=263)     if torch.cuda.get_device_capability()[0] < 8:
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
(VllmWorkerProcess pid=263)     prop = get_device_properties(device)
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
(VllmWorkerProcess pid=263)     _lazy_init()  # will define _get_device_properties
(VllmWorkerProcess pid=263)   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 279, in _lazy_init
(VllmWorkerProcess pid=263)     raise RuntimeError(
(VllmWorkerProcess pid=263) RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
ERROR 07-11 13:53:53 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 263 died, exit code: 1
INFO 07-11 13:53:53 multiproc_worker_utils.py:123] Killing local vLLM worker processes
```

### Expected behavior

Successfully loading model using vllm with multi-gpu.

### Others

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when inference using vllm with multi-gpu #4780

Reminder

System Info

Reproduction

Expected behavior

Others

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development