Getting started with Text Generation Inference (TGI) using a container to serve your LLM model

This blog post has the objective of providing a simple automation to set up and test Text Generation Inference (TGI) using a container. The automation may also be utilized as a starting point for more advanced automation later to optimize your GPU utilization when severing your AI model. Text Generation Inference (TGI)  implements many optimizations and valuable features.

Simple bash automation was created based on the documentation resource of the Hugging Face Text Generation Inference quick-tour and the Nvidia install guide for the container-toolkit. Here is a link to the supported models of the Hugging Face Text Generation Inference.

Note: It would be best if you have a working machine with a GPU. Here is an example setup on IBM Cloud for Virtual Server Instance in my blog post: How do you initially set up a Virtual Server Instance with a GPU in IBM Cloud?

“Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and T5.” Source Hugging Face 14.02.2024

Source of the image below is Hugging Face on 14.02.2024.

Containers are always an excellent choice to get started with because they significantly reduce installation and setup time; this is the reason why they selected containers for the Text Generation Inference quick-tour.

Used technologies and environment:

  •  Docker
  •  Python
  •  Nvidia
  •  Hugging Face
  •  bash
  •  Machine with a GPU and Ubuntu OS

Table of content

  1. The bash automation overview
  2. The bash automation source code
  3. Example output
    1. Setup
    2. Test
  4. Know issues when getting started
    1. You should install the container-toolkit to avoid the problem:
    2. Is your model not fully loaded? (Wait a bit)
    3. You may use the wrong model path.  (Verify you path settings for the volume)

1. The bash automation overview

Ensure you have followed the steps in the Nvidia install guide for the container-toolkit before you run the bash automation. The following code is an extraction of the documentation (14.02.2024).

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

The two bash automation execute the following steps:

  • Setup
  1. Create and map a data folder for the models
  2. Create a Python test client
  3. (Optional) Start and stop Docker service
  4. (Optional) Stop and remove the tgi_server container
  5. Define the model and run the tgi_server
  • Test
  1. Verify if the model is loaded.
  2. Start the Python application to test TGI.

2. The bash automation source code

  • Setup
#!/bin/bash
export HOME_PATH=$(pwd)
echo "1. Create and map a data folder for the models."
mkdir ${HOME_PATH}/tgi
mkdir ${HOME_PATH}/tgi/data
mkdir ${HOME_PATH}/tgi/app
export TGI_VOLUME=${HOME_PATH}/tgi/data
ls ${TGI_VOLUME}
echo "2. Create Python test client"
cat > ${HOME_PATH}/tgi/app/tgi_test.py <<EOF
import requests
headers = {
    "Content-Type": "application/json",
}
data = {
    "inputs": "What is Deep Learning?",
    "parameters": {
        "max_new_tokens": 20,
    },
}
response = requests.post('http://localhost:8080/generate', headers=headers, json=data)
print(response.json())
EOF
ls ${HOME_PATH}/tgi/app/
echo "3. (Optional) Start and stop Docker service."
sudo systemctl stop docker
sudo systemctl start docker
echo "4. (Optional) Stop and remove the tgi_server container."
docker stop tgi_server
docker container ls
docker rm tgi_server
echo "5. Define model and run the tgi_server"
export MODEL=tiiuae/falcon-7b-instruct
#export TAG=1.4
export TAG=latest
docker run -it --name tgi_server \
           --gpus all \
           --shm-size 1g \
           -p 8080:80 \
           -v ${TGI_VOLUME}:/data \
           ghcr.io/huggingface/text-generation-inference:${TAG} \
           --model-id $MODEL

  • Test in a new terminal
#!/bin/bash
export HOME_PATH=$(pwd)
export TGI_VOLUME=${HOME_PATH}/tgi/data
echo "1. Wait and verify is the model loaded."
ls -a -l ${TGI_VOLUME}
echo "2. Start the Python application to test TGI"
#source /YOUR_ENVIRONMENT/bin/activate
python3 ${HOME_PATH}/tgi/app/tgi_test.py

3. Example output

These are examples of the outputs of the two bash automations:

a. Setup in the first terminal

  • Executed command
bash tgi_setup.sh
  • Output
1. Create and map a data folder for the models
2. Create Python test client
.  ..  tgi_test.py
3. (Optional) Start and stop Docker service
Warning: Stopping docker.service, but it can still be activated by:
  docker.socket
4. (Optional) Stop and remove the tgi_server container
tgi_server
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
tgi_server
5. Define model and run the tgi_server
71b6bb31f487664655893c5c8d7baa8284197969be20bc4ff2b767b63c9ed4fb
2024-02-14T19:04:48.990447Z  INFO text_generation_launcher: Args { model_id: "tiiuae/falcon-7b-instruct", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, enable_cuda_graphs: false, hostname: "63014889664c", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, env: false }
2024-02-14T19:04:48.990587Z  INFO download: text_generation_launcher: Starting download process.
2024-02-14T19:04:57.434485Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-02-14T19:04:58.806577Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-02-14T19:04:58.806845Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
  • Executed command
bash tgi_test.sh

b. Test in the second terminal

  • Output
1. Wait and verify is the model loaded.
total 20
drwxr-xr-x 4 root root 4096 Feb 14 19:04 .
drwxr-xr-x 4 root root 4096 Feb 14 19:05 ..
drwxr-xr-x 3 root root 4096 Feb 14 19:07 .locks
drwxr-xr-x 6 root root 4096 Feb 14 19:09 models--tiiuae--falcon-7b-instruct
-rw-r--r-- 1 root root    1 Feb 14 19:07 version.txt
2. Start the Python application to test TGI
{'generated_text': '\nDeep learning is a branch of machine learning that uses artificial neural networks to learn and make decisions.'}

c. Verify your invocation in the first terminal.

2024-02-14T19:13:14.463676Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(20), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None } total_time="706.854766ms" validation_time="278.87µs" queue_time="52.905µs" inference_time="706.523117ms" time_per_token="35.326155ms" seed="None"}: text_generation_router::server: router/src/server.rs:299: Success

4. Know issues when getting started

1. You should install the container-toolkit to avoid the problem:

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

2. Is your model not fully loaded? (Wait a bit)

traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 790, in urlopen<br>    response = self._make_request( File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 536, in _make_request<br>    response = conn.getresponse()
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 461, in getresponse
    httplib_response = super().getresponse()
  File "/usr/lib/python3.10/http/client.py", line 1375, in getresponse<br>    response.begin()

3. You may use the wrong model path. (Verify you path settings for the volume)

'/root/.cache/huggingface/hub/tiiuae/falcon-7b-instruct'. Use repo_type argument if needed.

Inside the container, the filesystem differs from the host filesystem defined in the variable TIGS_VOLUME. The model storage path is `/data/`; you must mount the model MODEL inside the container. Therefore, you must point inside your container to the correct MODEL location.

This is an example to set the values.

TIGS_VOLUME=YOUR_LOCAL_DRIVE_OF_YOUR_MODELS
MODEL=/data/YOUR_LOCAL_MODELNAME
docker run -d --name tgi_server \
           --gpus all \
           --shm-size 1g \
           -p 8080:80 \
           -v ${TIGS_VOLUME}:/data \
           ghcr.io/huggingface/text-generation-inference:${TAG} \
           --model-id $MODEL

I hope this was useful to you and let’s see what’s next?

Greetings,

Thomas

#python, #gpu, #docker, #huggingface ,#nvidia, #bash, #automation, #container, #ai

Related blog posts:

One thought on “Getting started with Text Generation Inference (TGI) using a container to serve your LLM model

Add yours

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at WordPress.com.

Up ↑