This blog post contains an extract of a Jupyter Notebook
and shows a way how to create a new DataFrame by building an extraction of two specific columns and their values of an existing DataFrame.
The background is that I labeled data with Label Sleuth
. Label Sleuth
is an open-source no-code system for text annotation and building text classifiers, and to export the labeled train set. The image below shows the export dialog.

Here is an extract of the loaded data in a DataFrame inside a Jupyter Notebook for the exported data from Label Sleuth
.
import pandas as pd
df_input_data = pd.read_csv(input_csv_file_name)
df_input_data.head()
workspace_id | category_name | document_id | dataset | text | uri | element_metadata | label | label_type | |
---|---|---|---|---|---|---|---|---|---|
0 | Blog_post_01 | kubernetes_true | “#doc_id194” | pre_proce_level_2_v2 | “There is an article called Let’s embed AI int… | pre_proce_level_2_v2-“#doc_id194”-5 | {} False | Standard | |
1 | Blog_post_01 | kubernetes_true | “#doc_id6” | pre_proce_level_2_v2 | “For example; to add my simple git and cloud h… | pre_proce_level_2_v2-“#doc_id6”-10 | {} | False | Standard |
2 | Blog_post_01 | kubernetes_true | “#doc_id2” | pre_proce_level_2_v2 | “The enablement of the two models is availabl… | pre_proce_level_2_v2-“#doc_id2”-14 | {} False | Standard | |
3 | Blog_post_01 | kubernetes_true | “#doc_id47” | pre_proce_level_2_v2 | “Visit the hands-on workshop Use a IBM Cloud t… | pre_proce_level_2_v2-“#doc_id47”-10 | {} | True | Standard |
4 | Blog_post_01 | kubernetes_true | “#doc_id1” | pre_proce_level_2_v2 | “This is a good starting point to move on to t… | pre_proce_level_2_v2-“#doc_id1”-58 | {} | False | Standard |
In this example, we want to find all the values for the column category_name
which are equal to kubernetes_true
and the column label
contains the value True
at the same index position.
The following Python function verify_and_append_data
does the job to build a DataFrame by an extract of these two specific columns and their values of an existing DataFrame.
This function verify_and_append_data
will be used during an iteration of the input DataFrame.
The function is tailored for this column data structure ['workspace_id', 'category_name', 'document_id', 'dataset', 'text', 'uri', 'element_metadata', 'label', 'label_type']
and contains the parameters:
- The
search_column_1
andsearch_column_2
for the definition of columns to be used to get the search values. - The
search_val_1
andsearch_val_2
that does specify the values you are searching for in the columns. - The
index
to define the current index position of the input DataFrame. - The
input_dataframe
is the DataFrame that contains the input data. - The
output_dataframe
is the DataFrame which does contain the extracted data.
# The function uses the following data structure ['workspace_id', 'category_name', 'document_id', 'dataset', 'text', 'uri', 'element_metadata', 'label', 'label_type']
def verify_and_append_data ( search_column_1, search_column_2, search_val_1, search_val_2, index, input_dataframe, output_dataframe):
reference_1 = input_dataframe.loc[index, search_column_1]
reference_2 = str(input_dataframe.loc[index, search_column_2])
if ((reference_1 == search_val_1) and (reference_2 == search_val_2)):
# get the existing data from the input dataframe
workspace_id = input_dataframe._get_value(index, 'workspace_id', takeable=False)
category_name = input_dataframe._get_value(index, 'category_name', takeable=False)
document_id = input_dataframe._get_value(index, 'document_id', takeable=False)
dataset = input_dataframe._get_value(index, 'dataset', takeable=False)
text = input_dataframe._get_value(index, 'text', takeable=False)
uri = input_dataframe._get_value(index, 'uri', takeable=False)
element_metadata = input_dataframe._get_value(index, 'element_metadata', takeable=False)
label = input_dataframe._get_value(index, 'label', takeable=False)
label_type = input_dataframe._get_value(index, 'label_type', takeable=False)
# save the data for insertion
data_row = {'workspace_id':workspace_id ,'category_name':category_name,'document_id':document_id ,'dataset':dataset ,'text':text ,'uri':uri ,'element_metadata':element_metadata,'label':label ,'label_type':label_type}
# Create a temporary data frame that will be combined with output data frame
temp_df = pd.DataFrame([data_row])
# Combine the temporary data frame with the output data frame
output_dataframe = pd.concat([output_dataframe, temp_df], axis=0, ignore_index=True)
# Returns the status and the manipulated data frame
return True, output_dataframe
else:
return False, output_dataframe
The usage for the function.
columns = ['workspace_id', 'category_name', 'document_id', 'dataset', 'text', 'uri', 'element_metadata', 'label', 'label_type']
df_train_test_data = pd.DataFrame(columns=columns)
for row in range(len(df_input_data)):
verify_result = False
verify_result, df_train_test_data = verify_and_append_data('category_name', 'label', 'kubernetes_true', 'True', row, df_input_data, df_train_test_data)
verify_result, df_train_test_data = verify_and_append_data('category_name', 'label', 'watson_nlp_true', 'True', row, df_input_data, df_train_test_data)
verify_result, df_train_test_data = verify_and_append_data('category_name', 'label', 'kubernetes_false', 'True', row, df_input_data, df_train_test_data)
verify_result, df_train_test_data = verify_and_append_data('category_name', 'label', 'watson_nlp_false', 'True', row, df_input_data, df_train_test_data)
# show the different length of the two DataFrames
print(len(df_input_data), len(df_train_test_data))
df_train_test_data.head()
This is an example result of an execution inside a Jupyter Notebook.
1067 342
workspace_id | category_name | document_id | dataset | text | uri | element_metadata | label | label_type | |
---|---|---|---|---|---|---|---|---|---|
3 | Blog_post_01 | kubernetes_true | “#doc_id47” | pre_proce_level_2_v2 | “Visit the hands-on workshop Use a IBM Cloud t… | pre_proce_level_2_v2-“#doc_id47”-10 | {} | True | Standard |
I hope this was useful to you, and let’s see what’s next?
Greetings,
Thomas
#python, #jupyternotebook, #dataframe, #pandas, #concat
Leave a Reply