These days, CUDA, llama.cpp and using optimized GGUF format for models are getting more popular. Let us see what we need to run the optimized “Phi-3-Small-128K-Instruct” in a GGUF format with llama.cpp on an IBM Cloud Virtual Server Instance with GPUs and an Ubuntu 22.04 operating system.
Note: The
128kin the name contains the information on the model input token size.
These are the steps we need to take care of:
Step 1: Setup a VSI with a GPU on IBM Cloud
We can follow the steps in the linked blog post to set up the Virtual Server Instance. How do you initially set up a Virtual Server Instance with a GPU in IBM Cloud?
Step 2: We need to ensure to include the path to the NVIDIA CUDA Compiler Driver NVCC in the $PATH variable. The nvcc is a part of the nvidia-cuda-toolkit.
If we followed the instructions of the blog post in the Step 1 we notice that the nvcc is not available, but the nvidia-cuda-toolkit was installed, so we need to find the installation location.
Note: Later, we need the nvcc application, because the application will be invoked by the makefile in llama.cpp.
- Finding the
nvccapplication on our machine with following command:
find / | grep nvcc
- Output:
...
/usr/local/cuda-12.5/bin/nvcc
...
- Create a
.bashrcfile:
cd ~/
nano .bashrc
- Insert the following variable to the
.bashrcfile:
export NVCC_CUDA_TOOLS="/usr/local/cuda-12.5/bin/"
export PATH=$PATH:$NVCC_CUDA_TOOLS
# Define the CUDA_HOME variable
export CUDA_HOME="/usr/local/cuda-12.5/"
- Verify if the command is now available:
nvcc
Step 3: Ensure we have the latest gcc compiler
Later, we will compile the llama.cpp so we just ensure that we have the latest gcc compiler in place.
sudo apt install gcc
sudo apt install --reinstall gcc-12
Step 4: Compile the llama.cpp to use
This step is the most critical step for us because only with the right compilation we can use the GPUs. The following text extracts partly the documentation from the llama.cpp.
CUDA This provides GPU acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed.
Here we use the parameter LLAMA_CUDA=1
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_CUDA=1
Step 5: We generate a virtual Python environment
python3 -m venv my-env
source my-env/bin/activate
Step 6: We need to ensure that llama-cpp-python uses our compiled version
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir
Step 7: We download the needed model
We ensure that we can use the Huggingface CLI to download the model file.
- Install the CLI
python3 -m pip install huggingface-hub
- Download the model
huggingface-cli download QuantFactory/Phi-3-mini-128k-instruct-GGUF --local-dir . --local-dir-use-symlinks False --include='*Q8_0*gguf'
Step 8: Example code to run the model
The following code is an example of how to load a prompt from a pdf file.
import os
from llama_cpp import Llama
from PyPDF2 import PdfReader
###########
# Functions
def load_input (prompt):
filepath = os.path.abspath("./prompts/prompt.pdf")
with open(filepath, 'rb') as file:
pdf_reader = PdfReader(file)
prompt = []
for i in range(0, len(pdf_reader.pages)):
prompt.append(pdf_reader.pages[i].extract_text())
prompt = '\n'.join(prompt)
return prompt
###########
# Execution
# 1. Load prompt from pdf
prompt = load_input(prompt)
# 2. Load the model gguf file
model_path = "./models/Phi-3-mini-128k-instruct.Q8_0.gguf"
LLM = Llama(
model_path=model_path,
n_gpu_layers=-1, # Uncomment to use GPU acceleration (If -1, all layers are offloaded)
# seed=1337, # Uncomment to set a specific seed
n_ctx=128000, # Uncomment to increase the context window
echo=False
)
# 3. Generate an LLM response
output = llm.create_completion(prompt,
max_tokens=1000,
repeat_penalty=1.2,
temperature=0.0
)
print("\n*****\nLLM Output:\n",output['choices'][0]['text'])
Step 9: Run the model
- Run the model in the first terminal
source ./my-env/bin/activate
python3 -m pip install hugging face-hub
python3 -m pip install PyPDF2
python3 run_example_code.py
- Monitor the GPU usage in the second terminal
watch -n0.5 nvidia-smi
2. Summary
These are the essential takeaways from my perspective:
- Ensure you use the correct
nvccapplication version - Ensure to compile
llama-cppfor the right platform - Ensure you use the correct compiled version of
llama-cpp-pythonin your Python code
3. Additional resources
The following resource may be helpful in this context.
- Running Mistral on CPU via llama.cpp Blog post from Niklas Heidloff
- Pypi.org Llama cpp Python Library documentation
- Install llama-cpp-python with GPU Support Blog post from Manish Kovelamudi
- How to Install CUDA on Ubuntu 22.04 | Step-by-Step Blog post from Mantas Levinas
I hope this was useful to you, and let’s see what’s next?
Greetings,
Thomas
#python, #cuda, #llamacpp, #ibmcloud, #gpu, #ai

Leave a comment