Distributed inference with llama.cpp

As https://github.com/ggerganov/llama.cpp/pull/6829 (great job llama.cpp!) is in, should be possible to extend our grpc server to distribute the workload to workers.

From a quick look the upstream implementation looks quite lean as we need to pass params to llama.cpp directly. 

Only main point is that we want to propagate this setting from the CLI/env rather then having a config portion in the model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed inference with llama.cpp #2322

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development