InstructLab and Taxonomy tree: LLM Foundation Model Fine-tuning Guide | Musician Example

I wrote the blog post “Fine-tune LLM foundation models with the InstructLab an Open-Source project introduced by IBM and Red Hat” where I introduced the InstructLab and how the setup works.

Now, it is time to fine-tune the MODELS/MERLINITE-7B-LAB-Q4_K_M.GGUF using the “InstructLab”. The data for the fine-tuning will be saved in the Taxonomy repository folder, and we will use the “InstructLab CLI” to train, test and run the model.

As you may know, I like to compose music and play the drums, as “TNM – Drummer”. So, I tested the model MODELS/MERLINITE-7B-LAB-Q4_K_M.GGUF, to answer the question “Who is TNM – Drummer?” and it provided me with the wrong answer “Thomas Lang, better known by his stage name TNM – Drummer, is a skilled drummer who is famous for his work with the progressive rock band Dream Theater.”

The generated answer from the LLM was totally wrong 😉 I am not Thomas Lang, and I am not one of the best drummers in the world. Yes, I would like to be, but I am not.

Below you can see the chat I did:

ilab model chat         
╭──────────────────────────────────────────────────────────────────────────────── system ─────────────────────────────────────────────────────────────────────────────────╮
│ Welcome to InstructLab Chat w/ MODELS/MERLINITE-7B-LAB-Q4_K_M.GGUF (type /h for help)                                                                                   │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Who is "TNM - Drummer"?                                                                                                                                    [S][default]
╭────────────────────────────────────────────────────────────────── models/merlinite-7b-lab-Q4_K_M.gguf ──────────────────────────────────────────────────────────────────╮
│ Thomas Lang, better known by his stage name TNM - Drummer, is a skilled drummer who is famous for his work with the progressive rock band Dream Theater. He is          │
│ recognized for his intricate drumming style and impressive technique.                                                                                                   │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── elapsed 1.964 seconds ─╯

Note: Here is an updated version made in January 2025: InstructLab Fine-Tuning Guide: Updates and Insights for the Musician Example

The main structure of this blog post

The main structure of the current blog post is given by the last blog post, which extracted the YouTube Video “InstructLab Demo: Lowering the barrier to AI model development”. I will follow the steps from the last blog post for the fine-tuning.

This is the structure we following along during my “TNM – Drummer” example.

  1. Prepare data in the Taxonomy tree
  2. Validate data in the Taxonomy tree
  3. Generate synthetic data
  4. Train model
  5. Test model
  6. Convert model
  7. Serve model
  8. Contribute changes
  9. Summary

To follow the steps in this blog post with me, you first need to follow the steps in the post “Fine-tune LLM foundation models with the InstructLab an Open-Source project introduced by IBM and Red Hat” to prepare your local machine.

Here is an extract of the instructLab configuration from the last blog post saved in the config.yaml, and as you can see in the configuration file the model is merlinite-7b-lab-Q4_K_M.gguf.

chat:
 context: default
 ...
 model: models/merlinite-7b-lab-Q4_K_M.gguf
 ...
general:
 ...
generate:
 ...
 model: models/merlinite-7b-lab-Q4_K_M.gguf
 ...
serve:
 ...
 model_path: models/merlinite-7b-lab-Q4_K_M.gguf

Here is a 39 min YouTube video related to the fine tuning of a model on the local machine with the instructLab CLI.

1. Prepare data in the Taxonomy tree

In the InstructLab folder, we downloaded the Taxonomy repository; that repository contains the current data for the training; we will enhance this data.

Step 1: Identify where the changes or additional data would fit in the taxonomy tree

In our situation, we will add the folders artists/trademarks to the taxonomy/knowledge/music folder. The knowledge and music structure make sense, and there wasn`t a category like artist or trademark.

cd instructlab
mkdir taxonomy/knowledge/music/artists
mkdir taxonomy/knowledge/music/artists/trademarks

Step 2: Generate a qna.yaml

You must have at least five question-and-answer pairs of data as the input data; this data is used as examples to generate synthetic data. These can also be self-authored data. For details, see DataSources

The qna.yaml file has the following main entries:

  • version: The value must be the number. Required
  • task_description: A description of the skill. Required
  • created_by: The contributor’s GitHub username. Required
  • seed_examples: A collection of key/value entries. New submissions should have at least five entries, although older files may have fewer. Required
    • context: Grounded skills require the user to provide context containing information that the model is expected to take into account during processing. This is different from knowledge, where the model is expected to gain facts and background knowledge from the tuning process. The context key is optional for freeform skills.
    • question: A question for the model. Required
    • answer: The desired response from the model. Required

Here is an example for my TNM - Drummer example, here we didn’t add the context.

version: 1
created_by: thomassuedbroecker
seed_examples:
- answer: '"TNM - Drummer" is a trademark for generating music, video and pictures.'
  question: 'What is "TNM - Drummer"?'
- answer: '"TNM - Drummer" trademark repesents an artist who plays the drums, composes music, generates music video.'
  question: 'Who is "TNM - Drummer"?'
- answer: 'The TNM stands for Thomas Nikolaus Michael. These are the three surnames of the artist who owns the trademark "TNM - Drummer", which stands for creation of music, videos, and pictures.'
  question: 'What does "TNM - Drummer" stand for?'
- answer: '"TNM - Drummer" generates all styles of music and is not limitited to a special.'
  question: 'What kind of music does "TNM - Drummer" create?'
- answer: 'Yes, "TNM - Drummer" uses AI support to create his own music. "TNM - Drummer" uses AI support to generate his music; he also uses loops, or live recorded parts'
  question: 'Does "TNM - Drummer" use AI support to generate his music?'
task_description: 'Questions related to the Artist "TNM - Drummer"'

Depending on which data area you will contribute, you need to add the following entries to the yaml file. These are necessary for the knowledge folder entries and not for compositional_skills folder entries. The document is the source of your knowledge contribution, so it should contain some content that applies to the questions and answers you have provided. The GitHub repository includes some content I extracted from the home page, which is officially available.

document:
 repo: https://github.com/thomassuedbroecker/music-artists-trademark-tnm-drummer-knowledge.git
 commit: 8781e91
 patterns:
   - "*.md"
domain: music

Step 2: Generate an attribution.txt

Then, you create an attribution.txt file that includes the sources of your information.

“An essential part of contributing to the InstructLab project is citing your sources of information. This comes in the form of your attribution.txt, which you add to the pull requests. Almost all instances of attribution can be covered by the parameters required for Creative Commons Attribution licenses. Some parameters are as follows:

  • Title of work
  • Link to work
  • Include link to a specific revision where possible
  • License of the work
  • Include an SPDX identifier where possible
  • Creator names
  • Copyright information
  • Modification information
  • Indicate if work was itself derived from another openly licensed work”
Title of work: "TNM - Drummer" definition
Link to work: https://my_work
License of the work: THE LICENSE
Creator names: Thomas Suedbroecker

2. Validate data in the Taxonomy tree

After we insert the files inside the Taxonomy tree, we need to verify the correctness of the file and the folder structure.

Step 1: Run diff to verify the structure for the training

cd instructlab
source venv/bin/activate
ilab taxonomy diff
  • Output:
knowledfe/music/artists/trademarks/qnq.yaml
Taxonomy in taxonomy is valid 🙂

3. Generate synthetic data

In this example, we provided only five examples; this is definitely too little for the training.

Step 1: Generate synthetic data

Now it is time to generate more data. ilab will generate out-of-the-box 100 data entries in the following JSON format.

 {
        "instruction": "Can you describe the drumming style of \"TNM - Drummer\"?\\n",
        "input": "<noinput>\\n",
        "output": "\"TNM - Drummer\" is known for his innovative drumming style. He often uses unconventional techniques and rhythm patterns to create an intriguing and engaging sound. His drumming can be characterized by a fusion of various styles, including electronic, pop and rock.",
        "taxonomy_path": "knowledge->music->artists->trademarks",
        "task_description": "Questions related to the Artist \"TNM - Drummer\" ",
        "document": [
            "# music-artists-trademark-tnm-drummer-knowledge\nThis is an example of an InstructLab taxonomy resource."
 ]
 },
  • Generate the synthetic data
ilab data generate
  • Output:

As you notice Cannot find prompt.txt. the prompt is not known, but you can reference the a prompt file name in the config.yaml file entry: prompt_file: prompts/prompt.txt

Cannot find prompt.txt. Using default prompt depending on model-family.
  0%|                                                                                                                  | 0/100 [00:00<?, ?it/s]Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2024-06-21 18:39:55,009 generate_data.py:505: generate_data Selected taxonomy path knowledge->music->artists->trademarks
...
Q> Tell me more about the drumming style of "TNM - Drummer".\n
I> 
A> "TNM - Drummer" is known for his unique and dynamic drumming style, which features a combination of complex rhythms and intricate patterns. His playing often showcases a wide range of techniques, including double bass drums, multiple kick drums, and various cymbal crashes. His style is characterized by its energy and power, as well as its technical precision and musicality
....
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [34:11<00:00, 20.52s/it]
INFO 2024-06-21 18:14:06,562 generate_data.py:609: generate_data 100 instructions generated, 97 discarded due to format (see generated/discarded_merlinite-7b-lab-Q4_K_M_2024-06-21T13_39_55.log), 11 discarded due to rouge score
INFO 2024-06-21 18:14:06,562 generate_data.py:613: generate_data Generation took 2052.35s

Step 5: Verify the generated data

The generated data is split into train test data.

Now we have the chance to verify the data, but for now, this is not in scope ;-).

tree -L 1 ./generated
./generated
├── discarded_merlinite-7b-lab-Q4_K_M_2024-06-21T13_39_55.log
├── generated_merlinite-7b-lab-Q4_K_M_2024-06-21T13_39_55.json
├── test_merlinite-7b-lab-Q4_K_M_2024-06-21T13_39_55.jsonl
└── train_merlinite-7b-lab-Q4_K_M_2024-06-21T13_39_55.jsonl

4. Train model

Step 1: Train the model

In this step, we train the model on our local machine.
Note: The fine-tuning for the LLM is realized with “low rank adaptation (LoRA)” (Link 25.06.2024).

ilab model train
  • Output:
[INFO] Loading
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████| 655/655 [00:00<00:00, 10.7MB/s]
added_tokens.json: 100%|██████████████████████████████████████████████████████████████████████████████████████| 119/119 [00:00<00:00, 1.59MB/s]
...
model-00003-of-00003.safetensors:  22%|███████████████                                                    | 1.02G/4.54G [05:14<14:43, 3.99MB/s]
model-00001-of-00003.safetensors:  19%|█████████████▏                                                      | 954M/4.94G [05:12<20:40, 3.22MB/s]
...
Epoch 1: Iter 20: Val loss 1.124, Val took 15.492s
Iter 20: Saved adapter weights to instructlab-merlinite-7b-lab-mlx-q/adapters-020.npz.
Iter 030: Train loss 0.986, It/sec 0.202, Tokens/sec 125.626
Epoch 2: Iter 30: Val loss 0.890, Val took 15.864s
Iter 30: Saved adapter weights to instructlab-merlinite-7b-lab-mlx-q/adapters-030.npz.

Step 2: Verify folders for the created model

Now we see the folders instructlab-merlinite-7b-lab and instructlab-merlinite-7b-lab-mlx-q.

  • instructlab-merlinite-7b-lab
tree -L 1 ./instructlab-merlinite-7b-lab
  • Output:
./instructlab-merlinite-7b-lab
├── added_tokens.json
├── config.json
├── generation_config.json
├── model-00001-of-00003.safetensors
...
├── model.safetensors.index.json
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer.model
└── tokenizer_config.json
  • instructlab-merlinite-7b-lab-mlx-q
tree -L 1 ./instructlab-merlinite-7b-lab-mlx-q
  • Output:
./instructlab-merlinite-7b-lab-mlx-q
├── adapters-010.npz
├── adapters-020.npz
...
├── adapters.npz
├── added_tokens.json
├── config.json
├── model.safetensors
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer.model
└── tokenizer_config.json

5. Test model

Step 1: Test the model

The generated synthetic data is split to train and test data as we can also see in the taxonomy_data folder.

tree -L 1 ./taxonomy_data
./taxonomy_data
├── test.jsonl
├── test_gen.jsonl
├── train.jsonl
├── train_gen.jsonl
└── valid.jsonl

Now, let us start a model test.

ilab model test
  • Output:

We see the tests during the test run, and we can verify whether the information is valid in the terminal. At the moment, we do not this verification; we just move on to this example.

system prompt: You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.
[1]
 user prompt: What is "TNM - Drummer"?\n 
expected output: "TNM - Drummer" is a trademark for generating music, video and pictures.\n
-----model output BEFORE training----:
Loading pretrained model
Using model_type='mistral'
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
LoRA init skipped
Total parameters 1242.375M
Trainable parameters 0.000M
Loading datasets
LoRA loading skipped
Generating
==========
"TNM - Drummer" is an abbreviation that represents the TNM (Tempo, Nuance, and Motion) notation system commonly used in Western music to describe the dynamics and style of drum performances. The TNM notation provides a detailed guide for drumming techniques, ensuring that the drummer can accurately replicate the intended sound and feel of the composition. Here's an explanation of the components of the TNM notation:
- **T
==========
-----model output AFTER training----:
Loading pretrained model
Using model_type='mistral'
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Total parameters 1244.079M
Trainable parameters 1.704M
Loading datasets
Generating
==========
"TNM - Drummer" is a new artist who has recently been gaining popularity in the music industry. He is known for his unique drumming style and his ability to create innovative rhythms that are both captivating and memorable. His music has been featured in various films and television shows, and he has also collaborated with several well-known artists.
==========
[2]
 user prompt: Who is "TNM - Drummer"?\n 
expected output: "TNM - Drummer" repesents an artist who plays the drums, composes music, generates music video.\n 

6. Convert model

After the test and hopefully the validation of the initial test results, we are ready to convert the model so that it can run on our local machine.

Step 1: Convert the model that ilab can serve the model on the local machine

Now we going to create a trained model instructlab-merlinite-7b-lab-trained/instructlab-merlinite-7b-lab.gguf in gguf format.

ilab model convert
  • Output:
Loading pretrained model
Using model_type='mistral'
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
...
[INFO] Loading
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
dtype=<class 'numpy.float16'>
...
llama_model_quantize_internal: model size  = 13813.14 MB
llama_model_quantize_internal: quant size  =  4165.41 MB
INFO 2024-06-21 15:36:07,157 convert.py:112: convert deleting instructlab-merlinite-7b-lab-trained/instructlab-merlinite-7b-lab.gguf...

Step 2: Verify the converted model location

In the output, we will see the generated instructlab-merlinite-7b-lab-Q4_K_M.gguf file.

tree -L 1 ./instructlab-merlinite-7b-lab-trained
  • Output:
./instructlab-merlinite-7b-lab-trained
├── added_tokens.json
├── config.json
├── instructlab-merlinite-7b-lab-Q4_K_M.gguf
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer.model
└── tokenizer_config.json

7. Serve model

Now we will serve the model and access it model using the REST API and we will directly chat with the model.

Step 1: Serve the new fine-tuned model

ilab model serve --model-path ./instructlab-merlinite-7b-lab-trained/instructlab-merlinite-7b-lab-Q4_K_M.gguf
  • Output:
INFO 2024-06-21 19:46:05,160 serve.py:51: serve Using model './instructlab-merlinite-7b-lab-trained/instructlab-merlinite-7b-lab-Q4_K_M.gguf' with -1 gpu-layers and 4096 max context size.
INFO 2024-06-21 19:46:08,295 server.py:218: server Starting server process, press CTRL+C to shutdown server...
INFO 2024-06-21 19:46:08,295 server.py:219: server After application startup complete see http://127.0.0.1:8000/docs for API.

Step 2: Open a browser and enter the URL http://127.0.0.1:8000/docs

Step 3: Using curl to interact with the served model

Enter for following curl command:

curl -X 'POST' \
  'http://127.0.0.1:8000/v1/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "\n\n### Instructions:\nWhat is the capital of France?\n\n### Response:\n",
  "stop": [
    "\n",
    "###"
  ]
}'
  • Output
{"id":"cmpl-3e4d9c5e-e0e9-4fb1-b9b4-b79093fa0106","object":"text_completion","created":1720678017,"model":"./instructlab-merlinite-7b-lab-trained/instructlab-merlinite-7b-lab-Q4_K_M.gguf","choices":[{"text":"Paris","index":0,"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":21,"completion_tokens":3,"total_tokens":24}}

Step 4: Open a new terminal and chat with the model

We will not use the REST API; we will directly chat with the model.

ilab model chat -m ./instructlab-merlinite-7b-lab-trained/instructlab-merlinite-7b-lab-Q4_K_M.gguf
  • Output:
You are using an aliased command, this will be deprecated in a future release. Please consider using `ilab model chat` instead
╭────────────────────────────────────────────────────────────────────── system ───────────────────────────────────────────────────────────────────────╮
│ Welcome to InstructLab Chat w/ ./INSTRUCTLAB-MERLINITE-7B-LAB-TRAINED/INSTRUCTLAB-MERLINITE-7B-LAB-Q4_K_M.GGUF (type /h for help)                   │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
>>> Who is TNM - Drummer?                                                                                                                  [S][default]
╭────────────────────────────────── ./instructlab-merlinite-7b-lab-trained/instructlab-merlinite-7b-lab-Q4_K_M.gguf ──────────────────────────────────╮
│ TNM - Drummer is an artist who specializes in drumming. He has gained recognition for his skills in drumming and has collaborated with other        │
│ musicians on various projects.                                                                                                                      │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── elapsed 1.447 seconds ─╯
>>> 

This result is okay, but it’s crucial that we take time to verify the generated synthetic data to ensure its accuracy and reliability.
Following this result, I verified some entries of the generated synthetic data. I noticed an incorrect entry that shows we need to pay attention to the correctness of the data before proceeding with the model training. After all, I am not all “Tony Nguyen” ;-).
Here is an extract of the generated synthetic data in JSON format.

        "instruction": "What can you tell me about \"TNM - Drummer's\" early career and upbringing?\\n",
        "input": "<noinput>\\n",
        "output": "TNM - Drummer, whose real name is Tony Nguyen, was born in Saigon, Vietnam. He started playing the drums at a very young age and developed his skills through self-study and practicing extensively. His passion for music led him to perform in local bars and clubs before he gained national recognition.",
        "taxonomy_path": "knowledge->music->artists->trademarks",
        "task_description": "Questions related to the Artist \"TNM - Drummer\" ",
        "document": [
            "# music-artists-trandemark-tnm-drummer-knowledge\nThis is an example for an InstructLab taxonomy resource."

8. Contribute changes

I will not contribute in this situation, but if the model responds as expected, we could create a GitHub pull request for the Taxonomy repository to include the change for the “TNM – Drummer.” How the contribution in detailed works is available in the contributing documentation.

9. Summary

This InstructLab is impressive, and the idea of contributing to foundation models by only contributing trusted data is fantastic. Anyone who contributes proven data to foundation models will get the real fine-tuning in a professional environment later, and then the updated model will be available on HuggingFace.


Trust in AI is mainly built on trusted, legal, and verified training data, so the licenses must be correct. So you get what you provided. It is fair for the free Compute Power to update the foundation model with trusted and open data that everyone can verify. This is absolute transparency of the given data. That is a good approach, from my perspective, for training data in the future to build trusted foundation models.


By the way, the InstructLab CLI works very well on my machine. You may test it independently and then feel free to contribute.

Here is an updated version made in January 2025: InstructLab Fine-Tuning Guide: Updates and Insights for the Musician Example


I hope this was useful to you and let’s see what’s next?

Greetings,

Thomas

#finetuning, #llm, #instructlab, #ai, #opensource, #redhat, #ibm

6 thoughts on “InstructLab and Taxonomy tree: LLM Foundation Model Fine-tuning Guide | Musician Example

Add yours

  1. I am facing this error, need help!

    (instructlab) ai@FOOBOO:~/.local/share/instructlab/datasets$ ilab model train –pipeline simple –gpus 1
    INFO 2025-02-07 11:55:08,246 numexpr.utils:162: NumExpr defaulting to 16 threads.
    INFO 2025-02-07 11:55:08,608 datasets:59: PyTorch version 2.5.1+cu121 available.
    LINUX_TRAIN.PY: NUM EPOCHS IS: 10
    LINUX_TRAIN.PY: TRAIN FILE IS: /home/ai/.local/share/instructlab/datasets/train_gen.jsonl
    LINUX_TRAIN.PY: TEST FILE IS: /home/ai/.local/share/instructlab/datasets/test_gen.jsonl
    LINUX_TRAIN.PY: Using device ‘cpu’
    LINUX_TRAIN.PY: LOADING DATASETS
    Generating train split: 0 examples [00:00, ? examples/s]
    An error occurred while generating the dataset

    (instructlab) ai@FOOBOO:~/.local/share/instructlab/datasets$ tree -h /home/ai/.local/share/instructlab/datasets -P “*.jsonl” –prune
    [4.0K] /home/ai/.local/share/instructlab/datasets
    ├── [4.0K] 2025-02-07_034752
    │   ├── [ 0] messages_2025-02-07T03_48_00.jsonl
    │   ├── [4.0K] node_datasets_2025-02-07T03_48_00
    │   │   └── [ 53K] mmlubench_knowledge_arts_music_fandom_swifties.jsonl
    │   ├── [4.0K] preprocessed_2025-02-07T03_48_00
    │   │   └── [269K] knowledge_arts_music_fandom_swifties.jsonl
    │   ├── [305K] test_2025-02-07T03_48_00.jsonl
    │   └── [ 0] train_2025-02-07T03_48_00.jsonl
    ├── [305K] test_gen.jsonl
    └── [ 0] train_gen.jsonl

    Like

    1. @Prasad that you are using the command:

      ilab model train –pipeline simple –gpus 1

      Please verify your system setting and take a look into following link for the training pipeline: https://github.com/instructlab/instructlab?tab=readme-ov-file#instructlab-model-training-pipelines . I hope this helps.

      You can join the Discord community and ask additional questions there: https://discord.gg/PJUX5Wd6 this iniviation is valid for 7 days.
      BTW: May you can visit also my updated version of the blog post: https://suedbroecker.net/2025/02/04/instructlab-fine-tuning-guide-updates-and-insights-for-the-musician-example/

      Liked by 1 person

Leave a reply to Prasad Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at WordPress.com.

Up ↑