llama.cpp

llama.cpp is an open-source library for running large language models (LLMs) locally with high performance and minimal dependencies. It is optimized for systems with limited GPU capabilities, allowing the use of CPU and RAM to perform text generation tasks without relying heavily on GPU power.

Model

Model Selection:

Choose the model to use by providing the path to a GGUF file, a folder containing the model, or a HuggingFace model name and file (e.g., hf:ModelName/ModelFile).

Model Settings

Models Directory:

Define where models are stored or downloaded. The default directory is Data/HuggingFace but you can specify a custom path.
Main GPU:

Select the GPU to use for inference.
GPU Layers:

Set how many layers of the model to load onto the GPU. Adjusting this value helps balance GPU and CPU usage, especially on systems with limited GPU memory.
Threads:

Specify the number of CPU threads to use for processing. A value of 0 means all available threads will be used.
Split Mode:

Choose how to split the model across GPUs:
Tensor Splits:

Define how to split tensors across GPUs when using multiple GPUs (e.g., 30,70 for a 30%/70% split).
Context Size:

Set the number of tokens that the model can process in a single inference. Default is 4096.