Use Label Sleuth to build your train and test data input

This blog post is written in the context of creating a text classification model. The objective is to build and train a text classification model to identify topics in a text. These topics are Kubernetes and Watson NLP. As the train and test data, we will use an export from my blog on wordpress.com. This data we will label this with the open-source tool called Label Sleuth.

Label Sleuth needs for a CVS file two columns, the first column is called ‘document_id’ and the second is called ‘text’.

The ‘document_id’ is the id for the post and there will be linked multiple ‘text’ entries. These entries are representing the sentences (tokens) of a post.

document_id
|
+-- text
|
+-- text

The blog post is structured in:

  • Prepare the data structure for Label Sleuth
  • Installation of Label Sleuth
  • Use Label Sleuth to label the data
  • Summary

1. Prepare the data structure for Label Sleuth

Before we can use Label Sleuth we need to ensure we have the data in the valid format. To build this structure we run following sequence.

  1. Load the CSV file with pandas and process data with the pandas data frame.
  2. Extract the sentences from the post and save them as ‘document_id’ and ‘text’ using Natural Language Toolkit NLTK, we need to ensure to isolate sentences with the ‘.’.
  3. The “document_id” must contains all sentences (“text”) which are a part of the post.

1.1 Load the CSV file with pandas and process data it with the pandas data frame

import pandas as pd
df = pd.read_csv(csv_file_name)

1.2 Extract the sentences from the post and save them as ‘document_id’ and ‘text’ using Natural Language Toolkit NLTK, we need to ensure to isolate sentences with the ‘.’.

Here is an extraction of the source code from a Jupyter Notebook.

%%capture
!pip install nltk
import nltk
nltk.download('punkt')
import nltk as punkt
csvfile = "data_clean_for_label_sleuth.csv"
csvfile = open(csv_file_name,'w',encoding='utf-8')
csvfile_writer = csv.writer(csvfile)
csv_line = ["document_id", "text", "title", "pubdate", "doc_sentence_id"] 
csvfile_writer.writerow(csv_line)

i = 0
for row in range(len(df)):
    title = df.at[row, 'title']
    pubdate = df.at[row, 'pubdate']
    text = df.at[row, 'post']
    text_clean = text.replace('.',' . ')
    tokens = punkt.sent_tokenize(text_clean)
    j = 0
    for t in tokens:
        document_id = '#doc_id' + str(i)
        doc_sentence_id = '#doc_id' + str(i) + '.#sentence' + str(j) 
        text = t
        csv_line = [document_id, text, title, pubdate, doc_sentence_id]
        csvfile_writer.writerow(csv_line)
        j = j + 1
    i = i + 1
csvfile.close()

2. Installation of Label Sleuth

When you are going to use Label Sleuth you need to know that Label Sleuth is currently only supported for Python 9, so you need to maintain and install multiple versions of Python. This blog post helps to manage multiple versions of Python: Installing multiple versions of Python on Mac using Homebrew.

The installation is easy and well documented in Label Sleuth.

  1. Install Anaconda
  2. Activate Environment
conda create --yes -n label-sleuth python=3.9
conda activate label-sleuth
  1. Install Label Sleuth
pip install label-sleuth
  1. Start Label Sleuth
python -m label_sleuth.start_label_sleuth
  • Example output:
user:
thomassuedbroecker
2023-03-07 20:34:54,663 INFO     [start_label_sleuth.py:104] Starting label-sleuth using config file /Users/thomassuedbroecker/opt/anaconda3/envs/label-sleuth/lib/python3.9/site-packages/label_sleuth/config.json. Output directory is /Users/thomassuedbroecker/label-sleuth
2023-03-07 20:34:54,664 INFO     [start_label_sleuth.py:105] (Starting the service for the first time may take a little longer)
2023-03-07 20:34:56,450 INFO     [wasyncore.py:485] Serving on http://[::1]:8000
2023-03-07 20:34:56,450 INFO     [wasyncore.py:485] Serving on http://127.0.0.1:8000

3. Use Label Sleuth to label the data

3.1 Upload a new dataset

3.2 Create a workspace

3.3 Create a category to label

3.4 Label the category

3.5 More labeling

If you do more labeling for an existing category of a post, then the model version for that category will be updated.

3.6 Use of positive prediction

The created models will help you to find more sentences with a positive prediction for the actual category.

3.7 Export the labeled category data

Now you can also select to include automatically labeled data in your export.

4. Summary

From my perspective, the tool Label Sleuth is awesome, and easy-of-use for labeling/annotating the text for custom categories to create training and test data for your models. Check out the great tutorial.


I hope this was useful to you, and let’s see what’s next?

Greetings,

Thomas

#python, #jupyternotebook, #labelsleuth, #labeling, #annotating, #nltk

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at WordPress.com.

Up ↑

%d bloggers like this: