Using CUDA and Llama-cpp to Run a Phi-3-Small-128K-Instruct Model on IBM Cloud VSI with GPUs

These days, CUDA, llama.cpp and using optimized GGUF format for models are getting more popular. Let us see what we need to run the optimized “Phi-3-Small-128K-Instruct” in a GGUF format with llama.cpp on an IBM Cloud Virtual Server Instance with GPUs and an Ubuntu 22.04 operating system.

Note: The 128k in the name contains the information on the model input token size.

These are the steps we need to take care of:

Step 1: Setup a VSI with a GPU on IBM Cloud

We can follow the steps in the linked blog post to set up the Virtual Server Instance. How do you initially set up a Virtual Server Instance with a GPU in IBM Cloud?

Step 2: We need to ensure to include the path to the NVIDIA CUDA Compiler Driver NVCC in the $PATH variable. The nvcc is a part of the nvidia-cuda-toolkit.

If we followed the instructions of the blog post in the Step 1 we notice that the nvcc is not available, but the nvidia-cuda-toolkit was installed, so we need to find the installation location.

Note: Later, we need the nvcc application, because the application will be invoked by the makefile in llama.cpp.

  • Finding the nvcc application on our machine with following command:
find / | grep nvcc
  • Output:
...
/usr/local/cuda-12.5/bin/nvcc
...
  • Create a .bashrc file:
cd ~/
nano .bashrc
  • Insert the following variable to the .bashrc file:
export NVCC_CUDA_TOOLS="/usr/local/cuda-12.5/bin/"
export PATH=$PATH:$NVCC_CUDA_TOOLS
# Define the CUDA_HOME variable  
export CUDA_HOME="/usr/local/cuda-12.5/"
  • Verify if the command is now available:
nvcc

Step 3: Ensure we have the latest gcc compiler

Later, we will compile the llama.cpp so we just ensure that we have the latest gcc compiler in place.

sudo apt install gcc
sudo apt install --reinstall gcc-12

Step 4: Compile the llama.cpp to use

This step is the most critical step for us because only with the right compilation we can use the GPUs. The following text extracts partly the documentation from the llama.cpp.

CUDA This provides GPU acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed.

Here we use the parameter LLAMA_CUDA=1

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_CUDA=1

Step 5: We generate a virtual Python environment

python3 -m venv my-env
source my-env/bin/activate

Step 6: We need to ensure that llama-cpp-python uses our compiled version

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir

Step 7: We download the needed model

We ensure that we can use the Huggingface CLI to download the model file.

  • Install the CLI
python3 -m pip install huggingface-hub
  • Download the model
huggingface-cli download QuantFactory/Phi-3-mini-128k-instruct-GGUF --local-dir . --local-dir-use-symlinks False --include='*Q8_0*gguf'

Step 8: Example code to run the model

The following code is an example of how to load a prompt from a pdf file.

import os
from llama_cpp import Llama
from PyPDF2 import PdfReader

###########
# Functions
def load_input (prompt):
 filepath = os.path.abspath("./prompts/prompt.pdf")
    with open(filepath, 'rb') as file:
 pdf_reader = PdfReader(file)
 prompt = []
        for i in range(0, len(pdf_reader.pages)):
 prompt.append(pdf_reader.pages[i].extract_text())
 prompt = '\n'.join(prompt)
    return prompt

###########
# Execution

# 1. Load prompt from pdf
prompt = load_input(prompt)

# 2. Load the model gguf file
model_path = "./models/Phi-3-mini-128k-instruct.Q8_0.gguf"
LLM = Llama(
 model_path=model_path,
 n_gpu_layers=-1, # Uncomment to use GPU acceleration (If -1, all layers are offloaded)
      # seed=1337, # Uncomment to set a specific seed
 n_ctx=128000, # Uncomment to increase the context window
 echo=False
)

# 3. Generate an LLM response
output = llm.create_completion(prompt,
 max_tokens=1000,
 repeat_penalty=1.2,
 temperature=0.0
)

print("\n*****\nLLM Output:\n",output['choices'][0]['text'])

Step 9: Run the model

  • Run the model in the first terminal
source ./my-env/bin/activate
python3 -m pip install hugging face-hub
python3 -m pip install PyPDF2
python3 run_example_code.py
  • Monitor the GPU usage in the second terminal
watch -n0.5 nvidia-smi

2. Summary

These are the essential takeaways from my perspective:

  • Ensure you use the correct nvcc application version
  • Ensure to compile llama-cpp for the right platform
  • Ensure you use the correct compiled version of llama-cpp-python in your Python code

3. Additional resources

The following resource may be helpful in this context.


I hope this was useful to you, and let’s see what’s next?

Greetings,

Thomas

#python, #cuda, #llamacpp, #ibmcloud, #gpu, #ai

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at WordPress.com.

Up ↑