Fine-tune a large language model (llm) for multi-turn conversations and run it on a Text Generation Inference (TGI) server

This blog post is about the initial fine-tuning process for a large language model (llm) for multi-turn conversations and running the fine-tuned model on a Text Generation Inference (TGI) server on an IBM Cloud Virtual Server Instance. It also covers the entire process from the training of the model until the model is ready to be tested.

LLMs are becoming popular for multi-turn conversations with the introduction of ChatGPT’s interactive chat experience.

The blog post covers briefly several key topics, including knowing the use case, information resources, and tasks when moving from fine-tuning to running the model until the deployment of the fine-tuned model in an enterprise environment using watsonx.ai.

Table of content

Example for multi-turn
Resources
Topics related to fine-tune a llm
Knowing our use case
Knowing why we need to fine-tuning a model
Knowing the type of model we need to address our use case
Knowing the golden ground truth for valid, invalid, out-of-topic conversation flows for the use case
Knowing the data output format, we wish that our fine-tuning produces
Knowing the data format for the training data and testing data we want to use for the fine-tuning to achieve our needed output format
Preparing the train/test data using synthetic data generation if we don’t have enough data
Preparing and running the setup of a Virtual Server Instance on IBM Cloud with GPUs
Selecting the Libraries we want to use for the training and run the training
1. Prepare the fine-tuning by the installation of the needed libraries
2. Implement the fine-tuning
3. Run the fine-tuning
Run the fine-tuned model on a Text Generation Inference (TGI) server
Implementing or using existing evaluation/testing frameworks to test/evaluate the fine-tuned model
Define the metrics you want to use to display your evaluation results
Running our own fine-tuned model in a robust enterprise environment using watsonx.ai on-premise in a Cloud Pak for Data instance in a Virtual Private Cloud on AWS or on IBM Cloud
Summary

1. Example for multi-turn

Here is an example of one multi-turn conversation flow to get some weather data. We want to ensure that only the weather topic is relevant for the multi-turn conversation. The user can request to change the resulting data of the weather topic by refining his last question, with a new one, but we don’t expect that the user asks off-topic-questions for our defined multi-turn LLM configuration scenario.

Assistant represents the answer to a question a User has, our model should later address:

Multi-turn flow

Step	Role	Content	Notes
1	`User`	Can you please give me all your weather forecast data?	Getting the initial data.
	`Assistant`	Here is the data.	The fine-tuned model provides a valid SQL query, that will be used by an application to query the needed data from a system and displays the data to the user.
2	`User`	Can you please reduce the displayed data to the `US-south` region, excluding the city Dallas?	Reduce the result of the data to display.
	`Assistant`	Here is the data.	The fine-tuned model provides a valid SQL query, that will be used by an application to query the needed data from a system and displays the data to the user.
3	`User`	Who has won the soccer world cup in 1954?	Here the user is asking off-topic-questions, this type of questions must be covered in a multi-turn flow, that means we must be able to handle this.
	`Assistant`	I can only help you with weather data. Can you please rephrase your question?

2. Resources

I used various information resources as input to write my blog post. I want to highlight these three excellent blog posts in that context.

How to Fine-Tune LLMs in 2024 with Hugging Face by Phil Schimd
Fine-tuning LLMs via Hugging Face on IBM Cloud by Niklas Heidloff
Deploying LLMs via Hugging Face on IBM Cloud by Niklas Heidloff

3. Topics related to fine-tuning a llm

The following list contains the topics we need to take care of when we are moving from fine-tuning to running the fine-tuned model:

Knowing the “golden" ground truth containing valid, invalid, out-of-topic conversation flows for the use case
Knowing the data output format, we wish that our fine-tuning produces
Knowing the data format for the train/test data we want to use for the fine-tuning to achieve our needed output format
Preparing the train/test data using synthetic data generation if we don’t have enough data
Preparing and running the setup of a Virtual Server Instance on IBM Cloud with GPUs
Selecting the Libraries we want to use for the training and run the training
Run the fine-tuned model on a Text Generation Inference (TGI) server
Implementing or using existing evaluation/testing frameworks to test/evaluate the fine-tuned model

4. Knowing our use case

The decision to utilize a large language model (LLM) for managing conversations is not by accident for a use case. It’s a strategic choice driven by the recognition of the crucial role of data extraction in numerous business scenarios.

For example, a persona that needs to extract data from a database is a good case, and there are many other possible use cases. Let us choose “providing weather data” for this blog post, and for the implementation to realize it, we need “Text to SQL” because we don’t want the user to be limited to a specific set of questions. In this context, the solution in the Phil Schimd Text-to-SQL blog post fits our needs.

5. Knowing why we need to fine-tune a model

One of the main reasons for using a fine-tuned model is customization for specific tasks in combination with optimizing runtime cost by fine-tuning a smaller model for a particular task. Another reason can be to minimize the prompt token sizes or, in an optimal situation no prompt is needed, when a user or system interacts with the model.

6. Knowing the type of model we need to address our use case

We may have many different potential use cases for our business; for example summarization, categorization, or more.

The question often is: Which is the right model for our use case?

Let us assume we need to transform text to SQL, as mentioned in Phil Schimd’s blog post. We benchmark different models for text to SQL, and ask us the question: Can we easily fine-tune them? Resources to find information for Text-to-SQL can be found for example at Defog, or Hugging face LLM leader board. Assume we select the Mistral model for fine-tuning also as others did this before here is a blog post: Fine-Tuning the LLM Mistral-7b for Text-to-SQL with SQL-Create-Context Dataset, so that’s a way we can find a starting point for how to address our use case.

7. Knowing the “`golden" ground truth` for valid, invalid, out-of-topic conversation flows for the use case

We want to be able to measure the quality and accuracy after of our fine-tuning for our use case “providing weather data”. We need to be able that the response of our model is right.

We need now to have test data called "golden" ground truth.

The golden ground truth should consist of valid, invalid, out-off-topic conversation flows. We must provide this data input because we should know our use case at best; in many situations, others (consultants or other resources) can’t generate it for us because they are not familiar with our use case, and the consultants or other resources are adding additional then costs to our project.

That is a massive task because having the correct data and the needed amount of data, for example, at a minimum 1000 to train/test 1000 is a number we can find when we google for a minimum amount of dataset to fine-tune llm.

8. Knowing the data output format, we wish that our fine-tuning produces

Finally, our fine-tuned llm will integrated by the development into an application that implements the interaction with a user, external system, and internal system for our use case. This application will parse the raw data output of our model, which results from a response of a REST API or GRPC API call to the platform where our model runs. We should define a format that is the best in our situation.

An example response or answer can be generated in JSON format by agents.

Agents are a technique to define tools to respond to a request, which we define in a prompt, how the works you can with the blog post or Niklas Heidloff to get more details on this Mixtral Agents with Tools for Multi-turn Conversations.

Remember, this format is only for our use case implementation and can be different in other situations.

{
   "tool_name": "final answer",
   "sql": "SELECT * FROM WEATHER_DATA"
}

9. Knowing the data format for the training data and testing data we want to use for the fine-tuning to achieve our needed output format

The training data format impacts the response of an LLM, so this needs to be chosen wisely.

In our example, we will use the following data format for the training data, that is called the conversation format (JSONL).

{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

The basic structure for the training format we can find in the blog post How to Fine-Tune LLMs in 2024 with Hugging Face from (Phil Schimd).

In the JSON below, we see an array of messages that contains key value pairs of “role” and “content.

A role can be a system representing an essential prompt content for the model. When the role has the value “user” it represents that the “content” is a question a user can ask. The role of the assistant represents the answer content of the model.

So, this example training data set represents a single turn flow because we have only one question and one response.

{"messages": [{"role": "system", "content": "You are a weather data expert."}, {"role": "user", "content": "Can you please give me all your weather forecast data?"}, {"role": "assistant", "content": "```json{\"tool_name\": \"final answer\",\"sql\": \"SELECT * FROM WEATHER_DATA\"}```"}]}

A multi-turn flow looks like in the JSON below. It is essential to notice the length of the flow is variable. The conversation between the user and assistant can be from one to X turns.

The multi-turn flow below has two steps and ends with a refinement in our situation, but the user will not respond to the refinement request of the llm model in this situation.

{"messages": [{"role": "system", "content": "You are a weather data expert."}, {"role": "user", "content": "Can you please give me all your weather forecast data?"}, {"role": "assistant", "content": "```json{\"tool_name\": \"final answer\",\"sql\": \"SELECT * FROM WEATHER_DATA\"}```"}, {"role": "user", "content": "Who has won the soccer world cup in 1954?"}, {"role": "assistant", "content": "```json{\"tool_name\": \"refiner\",\"input\": \"I can only help you with weather data. Can you please rephrase your question?\"}```"}]}

We can use the same format for the test data, but you only use the most critical data of the User and Assistant.

In the table below we have two flows: one with a single-turn and one with two turns.

Flow number	User	Assistant	User	Assistant
1	Can you please give me all your weather forecast data?	`SELECT * FROM WEATHER_DATA`
2	Can you please give me all your weather forecast data?	`SELECT * FROM WEATHER_DATA`	Who has won the soccer world cup in 1954?	`I can only help you with weather data. Can you please rephrase your question?`

10. Preparing the train/test data using `synthetic data generation` if we don’t have enough data

The amount of the training data counts, so we need a minimum of data when we will fine-tune our llm.

There can be situations where we don’t have the minimum of 1000 data sets for fine-tuning, but we can use LLMs to generate synthetic data. It is fantastic how far this generation can go. However, we must keep in mind that we should be able to validate the correctness of the content of the generated synthetic data.

The blog post Generating Synthetic Data with Large Language Models from Niklas Heidloff can be helpful in this context. It is also possible to use watsonx.ai to create synthetic data here is the link to the documentation: Creating Synthetic data

11. Preparing and running the setup of a Virtual Server Instance on IBM Cloud with GPUs

During the initial training and testing of the model, we usually need to be very flexible in changing dataset configuration until the model response is as we expect.

If our local machine does not have enough power for training and running a fine-tuned LLM model, we can use for example a Virtual Server Instance on IBM Cloud with GPUs, and we can run our model later with the Text Generation Inference (TGI) from hugging face.

Here are two blog posts that could be helpful for these tasks:

12. Selecting the Libraries we want to use for the training and run the training

We select the Supervised Fine-tuning Trainer from Hugging Face for the model training and datasets to manage the data for the training split. Most of the fine-tuning implementation we reuse from the blog posts How to Fine-Tune LLMs in 2024 with Hugging Face written by Phil Schimd and Fine-tuning LLMs via Hugging Face on IBM Cloud written by Niklas Heidloff.

The Supervised Fine-tuning Trainer is widely used in the context of fine-tuning models.

12.1. Prepare the fine-tuning by the installation of the needed libraries

Some descriptions of the installed libraries on Hugging Face:

Supervised fine-tuning (or SFT for short) is a crucial step in RLHF (methodology for integrating human data labels into a RL-based optimization process.) in Hugging Face. The TRL – Transformer Reinforcement Learning of Hugging Face provides an easy-to-use API to create SFT models and train them with a few lines of code for a given dataset.
Transformers, “for example, _provides APIs _to download quickly and pre-trained_ models on a given text, fine-tune them on custom datasets, and share the model with the community on the Hugging Face model hub. At the same time, each Python module defining an architecture is fully standalone and can be modified to enable quick research experiments.”_
datasets, “provide one-liners to download and pre-process any of the number of datasets major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the HuggingFace Datasets Hub.”
PEFT (Parameter-Efficient Fine-Tuning) “is a library for efficiently adapting large pre-trained models to various downstream applications without fine-tuning all of a model’s parameters because it is prohibitively costly. PEFT methods only fine-tune a small number of (extra) model parameters – significantly decreasing computational and storage costs – while yielding performance comparable to a fully fine-tuned model. This makes it more accessible to train and store large language models (LLMs) on consumer hardware.”

python3 -m pip install "torch==2.1.2"
python3 -m pip install --upgrade "transformers==4.36.2" "datasets==2.16.1" "accelerate==0.26.1" "evaluate==0.4.1" "bitsandbytes==0.42.0"
python3 -m pip install git+https://github.com/huggingface/trl@a3c5b7178ac4f65569975efadc97db2f3749c65e --upgrade
python3 -m pip install git+https://github.com/huggingface/peft@4a1559582281fc3c9283892caea8ccef1d6f5a4f --upgrade

12.2. Implement the fine-tuning

Here is an example code based on the content of the two blog posts How to Fine-Tune LLMs in 2024 with Hugging Face written by Phil Schimd, and Fine-tuning LLMs via Hugging Face on IBM Cloud written by Niklas Heidloff.

Note: To fine-tune IBM foundation models, you can use the Tuning studio in watonx. Here is a link to the IBM Documentation for the Tuning Studio.

The code does the following steps:

Load training data and split the data only to train
- We remember that we maybe need to modify the training data input using the train_dataset.map function.
Configure the Bits and Bites quantization
Load the pre-trained model
Use for the training the chat format a function from Transformer Reinforcement Learning
Prepare the Supervised fine-tuning (or SFT for short)
Train the model
Clean-up

import argparse
import torch
import re

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from trl import setup_chat_format
from datasets import load_dataset
from peft import LoraConfig 
from trl import SFTTrainer

def format_data(input):
  
  list = []
  list = input['messages']

  i = 0
  for item in list:   
    if (str(item['role']) == "assistant"):
        update = re.sub('X','XXX', str(item['content'])) #depends on your data
        item['content'] = update
    list[i] = item
    i = i + 1
  
  input['messages'] = list
  result = input
  print(f"Create conversion after: {result}")
  
  return result

def main(args):
    # Base model id
    model_id = "mistralai/Mistral-7B-Instruct-v0.2"

    # Finetuned model id
    output_directory="/output/"
    peft_model_id=output_directory+"model"
    
    # Training data
    train_data_file="/synthetic_data/synthetic_data_generated.jsonl"

    # Load training data and split
    train_dataset = load_dataset("json", data_files=train_data_file, field='messages', split="train")
    train_dataset = train_dataset.map( format_data, batched=False)
    torch.utils.checkpoint.use_reentrant=True # Added by Thomas

    # Configure the Bits and Bites quantization
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, 
        bnb_4bit_use_double_quant=True, 
        bnb_4bit_quant_type="nf4", 
        bnb_4bit_compute_dtype=torch.float16  # Change from Niklas / Different from Phil Schmid's blog post
    )

    # Load the pre-trained model
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto",
        torch_dtype=torch.float16, # Change from Niklas / Different from Phil Schmid's blog post
        quantization_config=bnb_config
    )
    model.config.use_cache = False # Added by Thomas

    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.padding_side = 'right'

    # Use for the training the chat format
    model, tokenizer = setup_chat_format(model, tokenizer)

    peft_config = LoraConfig(
            lora_alpha=128, 
            lora_dropout=0.05,
            r=128, # Change from Niklas / Different from Phil Schmid's blog post
            bias="none",
            target_modules="all-linear",
            task_type="CAUSAL_LM"
    )

    args = TrainingArguments(
        output_dir=output_directory+"checkpoints", # The output directory where the model predictions and checkpoints will be written.
        logging_dir=output_directory+"logs", # Tensorboard log directory. Will default to runs/**CURRENT_DATETIME_HOSTNAME**.
        logging_strategy="steps",
        logging_steps=250,
        evaluation_strategy="steps", # Added by Thomas
        eval_steps=1000, # Added by Thomas
        save_steps=1000, # Number of updates steps before two checkpoint saves.
        num_train_epochs=12, # Total number of training epochs to perform.          
        per_device_train_batch_size=3, # The batch size per GPU/TPU core/CPU for training.
        gradient_accumulation_steps=2, # Number of updates steps to accumulate the gradients for, before performing a backward/update pass.
        gradient_checkpointing=True,
        gradient_checkpointing_kwargs={"use_reentrant":False},# Added by Thomas
        optim="adamw_torch_fused",   
        save_strategy="epoch",        
        learning_rate=2e-4, # The initial learning rate for Adam.
        fp16=True, # Whether to use 16-bit (mixed) precision training (through NVIDIA apex) instead of 32-bit training. Change from Niklas / Different from Phil Schmid's blog post
        max_grad_norm=0.3, # Maximum gradient norm (for gradient clipping).                   
        warmup_ratio=0.03, # Number of steps used for a linear warmup from 0 to learning_rate.                  
        lr_scheduler_type="constant",          
        push_to_hub=False,  # Change from Niklas / Different from Phil Schmid's blog post               
        auto_find_batch_size=True # Change from Niklas / Different from Phil Schmid's blog post
    )

    # Supervised fine-tuning (or SFT for short) 
    max_seq_length = 3072 
    trainer = SFTTrainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        peft_config=peft_config,
        max_seq_length=max_seq_length, # maximum packed length 
        tokenizer=tokenizer,
        packing=True,
        dataset_kwargs={
            "add_special_tokens": False, 
            "append_concat_token": False,
        }
    )

    # Train the model
    trainer.train()

    # Save the model and tokenizer
    trainer.model.save_pretrained(peft_model_id)
    tokenizer.save_pretrained(peft_model_id)

    # Clean-up
    del model
    del trainer
    torch.cuda.empty_cache()

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    args = parser.parse_args()
    main(args)parser.parse_args()

One wording definition for the given source code:

Epoch: An essential notion in real-time programming. Generally, several 11 epochs are ideal for training on most datasets. Learning optimization is based on the iterative process of gradient descent. Epoch an Essential Notion on DataScientest

12.3. Run the fine-tuning

python3 finetune.py

Example output for only three epochs

Generating:

Generating train split: 14 examples [00:00, 484.39 examples/s]

Checkpoint:

Loading checkpoint shards: &nbsp;67%|████████▋ &nbsp;  | 2/3 [03:21<01:40, 101.00s/it]

Final result:

...
{'train_runtime': 168.1095, 'train_samples_per_second': 0.214, 'train_steps_per_second': 0.036, 'train_loss': 0.9815422693888346, 'epoch': 3.0}
100%|███████████████████████████████████████████████████████████| 6/6 [02:48<00:00, 28.02s/it]
...

After the model is fine-tuned, we find a new folder that contains the fine-tuned model:

Remember, our defined model name in the source code before was model.

peft_model_id=output_directory+"model"

13. Run the fine-tuned model on a Text Generation Inference (TGI) server

The setup Text Generation Inference (TGI) we already did in “8. Preparing and running the setup of a Virtual Server Instance on IBM Cloud with GPUs“.

The following code shows an example bash automation to start the Text Generation Inference (TGI) server.

#!/bin/bash
export HOME_PATH=$(pwd)

export TGI_VOLUME=${HOME_PATH}/output # path to the fine tuned model
export MODEL=/data/model

export TAG=latest
docker container rm tgi_server
docker run -it --name tgi_server \
           --gpus all \
           --shm-size 1g \
           -p 8080:80 \
           -e MAX_INPUT_LENGTH=2000 \
           -v ${TGI_VOLUME}:/data \
           ghcr.io/huggingface/text-generation-inference:${TAG} \
           --model-id $MODEL

Example output on starting the TGI server:

2024-XX-XXTXX:XX:05.902471Z  INFO text_generation_launcher: Args { model_id: "/data/model", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 2000, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, enable_cuda_graphs: false, hostname: "43594b3833aa", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }

2024-XX-XXTXX:XX:05.902611Z  INFO download: text_generation_launcher: Starting download process.

2024-XX-XXTXX:XX:08.234407Z  INFO text_generation_launcher: Trying to load a Peft model. It might take a while without feedback
2024-XX-XXTXX:XX:45.290393Z  INFO text_generation_launcher: Peft model detected.
2024-XX-XXTXX:XX:45.290442Z  INFO text_generation_launcher: Merging the lora weights.

2024-XX-XXTXX:XX:55.359262Z  INFO text_generation_launcher: Saving the newly created merged model to /data/model
2024-XX-XXTXX:XX::23.753381Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-XX-XXTXX:XX:23.753637Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-XX-XXTXX:XX:30.399408Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-XX-XXTXX:XX:30.458966Z  INFO shard-manager: text_generation_launcher: Shard ready in 6.704379789s rank=0
2024-XX-XXTXX:XX:30.557558Z  INFO text_generation_launcher: Starting Webserver
2024-XX-XXTXX:XX:30.770246Z  INFO text_generation_router: router/src/main.rs:237: Using local tokenizer config
2024-XX-XXTXX:XX:30.775596Z  WARN text_generation_router: router/src/main.rs:272: no pipeline tag found for model /data/model-example
2024-XX-XXTXX:XX:30.793831Z  INFO text_generation_router: router/src/main.rs:291: Warming up model<br>2024-XX-XXTXX:XX:32.244560Z  INFO text_generation_router: router/src/main.rs:328: Setting max batch total tokens to 227968
2024-XX-XXTXX:XX:32.244587Z  INFO text_generation_router: router/src/main.rs:329: Connected
...

Now, we are ready to test our fine-tuned model.

14. Implementing or using existing evaluation/testing frameworks to test/evaluate the fine-tuned model

When we test the newly fine-tuned model in a multi-turn way, we must save the conversation data between the model and the user.

Therefore, we can use Langchain for Long-term memory using persistent storage or implement our testing framework using Short-term memory to handle our multi-turn conversations where we remember previous inputs, prompts, and context to generate the following response.

Extract an example source code to build the following prompt based on previous input.

prompt = prompt_history + generate_prompt_from_template( prompt_template_2, prompt_question_template, question )

There are some additional evaluation frameworks in the blog post “Open-Source LLM Evaluation Frameworks in 2024” and for tests related to Text-to-SQL the Defog eval framework may can be helpful.

15. Define the metrics you want to use to display your evaluation results

Here are some potential metrics which can be useful: accuracy, latency, and grammar and there are many more. We can implement this metrics by ourself or using products for that. A good approach could be using watsonx.governance; it covers the following topics listed in its description:

Govern generative AI (gen AI) and machine learning (ML) models from any vendor, including IBM® watsonx.ai™, Amazon Sagemaker and Bedrock, Google Vertex and Microsoft Azure.
Evaluate and monitor for model health, accuracy, drift, bias, and gen AI quality.
Access robust governance, risk, and compliance capabilities featuring workflows with approvals, customizable dashboards, risk scorecards, and reports.
Use factsheet capabilities to collect and document model metadata automatically across the AI model lifecycle.

16. Running our own fine-tuned in a robust enterprise environment using `watsonx.ai` on-premise in a Cloud Pak for Data instance in a `Virtual Private Cloud` on AWS or on IBM Cloud

watsonx provides a wide range of essential capabilities and is available on several platforms for enterprise use. The following list is an extract of the main topics you can find on the official web page.

Open: Based on open technologies that provide a variety of models to cover enterprise use cases and compliance requirements.
Targeted: Targeted to specific enterprise domains like HR, customer service or IT operations to unlock new value.
Trusted: Designed with principles of transparency, responsibility and governance so you can manage legal, regulatory, ethical and accuracy concerns.
Empowering: Go beyond being an AI user and become an AI value creator, owning the value your models create.

We can deploy our custom model fine-tuned model on-premise by following the instructions in the IBM Cloud documentation

14. Summary

As you noticed in this blog post, there are a lot of topics you need to take care of from the initial idea until a fine-tuned model gets into production.

We follow the topics related to fine-tuning a model.

I know I didn’t have a deep dive into all the topics, but I want in this blog post, that I ensure that at a minimum I have a sentence for each of the topics, or I provided a potential entry point you can look into.

LLMs will change our lives and society in the future, and it is fantastic when you are a part of the journey to change lives by changing the business and not being changed.

I hope this was useful to you and let’s see what’s next?

Greetings,

Thomas

#fine-tune, #llm, #multi-turn, #watsonx, #mistral, #ai

Fine-tune a large language model (llm) for multi-turn conversations and run it on a Text Generation Inference (TGI) server

1. Example for multi-turn

2. Resources

3. Topics related to fine-tuning a llm

4. Knowing our use case

5. Knowing why we need to fine-tune a model

6. Knowing the type of model we need to address our use case

7. Knowing the “`golden" ground truth` for valid, invalid, out-of-topic conversation flows for the use case

8. Knowing the data output format, we wish that our fine-tuning produces

9. Knowing the data format for the training data and testing data we want to use for the fine-tuning to achieve our needed output format

10. Preparing the train/test data using `synthetic data generation` if we don’t have enough data

11. Preparing and running the setup of a Virtual Server Instance on IBM Cloud with GPUs

12. Selecting the Libraries we want to use for the training and run the training

12.1. Prepare the fine-tuning by the installation of the needed libraries

12.2. Implement the fine-tuning

12.3. Run the fine-tuning

13. Run the fine-tuned model on a Text Generation Inference (TGI) server

14. Implementing or using existing evaluation/testing frameworks to test/evaluate the fine-tuned model

15. Define the metrics you want to use to display your evaluation results

16. Running our own fine-tuned in a robust enterprise environment using `watsonx.ai` on-premise in a Cloud Pak for Data instance in a `Virtual Private Cloud` on AWS or on IBM Cloud

14. Summary

Leave a comment Cancel reply

Blog Stats

1. Example for multi-turn

2. Resources

3. Topics related to fine-tuning a llm

4. Knowing our use case

5. Knowing why we need to fine-tune a model

6. Knowing the type of model we need to address our use case

7. Knowing the “golden" ground truth for valid, invalid, out-of-topic conversation flows for the use case

8. Knowing the data output format, we wish that our fine-tuning produces

9. Knowing the data format for the training data and testing data we want to use for the fine-tuning to achieve our needed output format

10. Preparing the train/test data using synthetic data generation if we don’t have enough data

11. Preparing and running the setup of a Virtual Server Instance on IBM Cloud with GPUs

12. Selecting the Libraries we want to use for the training and run the training

12.1. Prepare the fine-tuning by the installation of the needed libraries

12.2. Implement the fine-tuning

12.3. Run the fine-tuning

13. Run the fine-tuned model on a Text Generation Inference (TGI) server

14. Implementing or using existing evaluation/testing frameworks to test/evaluate the fine-tuned model

15. Define the metrics you want to use to display your evaluation results

16. Running our own fine-tuned in a robust enterprise environment using watsonx.ai on-premise in a Cloud Pak for Data instance in a Virtual Private Cloud on AWS or on IBM Cloud

14. Summary

Share this:

Related

Leave a comment Cancel reply

Blog Stats

7. Knowing the “`golden" ground truth` for valid, invalid, out-of-topic conversation flows for the use case

10. Preparing the train/test data using `synthetic data generation` if we don’t have enough data

16. Running our own fine-tuned in a robust enterprise environment using `watsonx.ai` on-premise in a Cloud Pak for Data instance in a `Virtual Private Cloud` on AWS or on IBM Cloud