Skip to content

Distributed inference with llama.cpp #2322

Closed
@mudler

Description

As ggml-org/llama.cpp#6829 (great job llama.cpp!) is in, should be possible to extend our grpc server to distribute the workload to workers.

From a quick look the upstream implementation looks quite lean as we need to pass params to llama.cpp directly.

Only main point is that we want to propagate this setting from the CLI/env rather then having a config portion in the model

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions