A way to build a DataFrame from a DataFrame in a Jupyter Notebook with Python

This blog post contains an extract of a Jupyter Notebook and shows a way how to create a new DataFrame by building an extraction of two specific columns and their values of an existing DataFrame.

The background is that I labeled data with Label SleuthLabel Sleuth is an open-source no-code system for text annotation and building text classifiers, and to export the labeled train set. The image below shows the export dialog.

Here is an extract of the loaded data in a DataFrame inside a Jupyter Notebook for the exported data from Label Sleuth.

import pandas as pd
df_input_data = pd.read_csv(input_csv_file_name)
df_input_data.head()
workspace_idcategory_namedocument_iddatasettexturielement_metadatalabellabel_type
0Blog_post_01kubernetes_true“#doc_id194”pre_proce_level_2_v2“There is an article called Let’s embed AI int…pre_proce_level_2_v2-“#doc_id194”-5{} FalseStandard
1Blog_post_01kubernetes_true“#doc_id6”pre_proce_level_2_v2“For example; to add my simple git and cloud h…pre_proce_level_2_v2-“#doc_id6”-10{}FalseStandard
2Blog_post_01kubernetes_true“#doc_id2”pre_proce_level_2_v2“The enablement of the two models is availabl…pre_proce_level_2_v2-“#doc_id2”-14{} FalseStandard
3Blog_post_01kubernetes_true“#doc_id47”pre_proce_level_2_v2“Visit the hands-on workshop Use a IBM Cloud t…pre_proce_level_2_v2-“#doc_id47”-10{}TrueStandard
4Blog_post_01kubernetes_true“#doc_id1”pre_proce_level_2_v2“This is a good starting point to move on to t…pre_proce_level_2_v2-“#doc_id1”-58{}FalseStandard

In this example, we want to find all the values for the column category_name which are equal to kubernetes_true and the column label contains the value True at the same index position.

The following Python function verify_and_append_data does the job to build a DataFrame by an extract of these two specific columns and their values of an existing DataFrame.

This function verify_and_append_data will be used during an iteration of the input DataFrame.

The function is tailored for this column data structure ['workspace_id', 'category_name', 'document_id', 'dataset', 'text', 'uri', 'element_metadata', 'label', 'label_type'] and contains the parameters:

  • The search_column_1 and search_column_2 for the definition of columns to be used to get the search values.
  • The search_val_1 and search_val_2 that does specify the values you are searching for in the columns.
  • The index to define the current index position of the input DataFrame.
  • The input_dataframe is the DataFrame that contains the input data.
  • The output_dataframe is the DataFrame which does contain the extracted data.
# The function uses the following data structure  ['workspace_id', 'category_name', 'document_id', 'dataset', 'text', 'uri', 'element_metadata', 'label', 'label_type']
def verify_and_append_data ( search_column_1, search_column_2, search_val_1, search_val_2, index, input_dataframe, output_dataframe):
    
    reference_1 = input_dataframe.loc[index, search_column_1]
    reference_2 = str(input_dataframe.loc[index, search_column_2])
    
    if ((reference_1 == search_val_1) and (reference_2 == search_val_2)):
        
        # get the existing data from the input dataframe
        workspace_id     = input_dataframe._get_value(index, 'workspace_id', takeable=False)
        category_name    = input_dataframe._get_value(index, 'category_name', takeable=False)
        document_id      = input_dataframe._get_value(index, 'document_id', takeable=False)
        dataset          = input_dataframe._get_value(index, 'dataset', takeable=False)
        text             = input_dataframe._get_value(index, 'text', takeable=False)
        uri              = input_dataframe._get_value(index, 'uri', takeable=False)
        element_metadata = input_dataframe._get_value(index, 'element_metadata', takeable=False)
        label            = input_dataframe._get_value(index, 'label', takeable=False)
        label_type       = input_dataframe._get_value(index, 'label_type', takeable=False) 
        
        # save the data for insertion
        data_row = {'workspace_id':workspace_id ,'category_name':category_name,'document_id':document_id ,'dataset':dataset ,'text':text ,'uri':uri ,'element_metadata':element_metadata,'label':label ,'label_type':label_type}
               
        # Create a temporary data frame that will be combined with output data frame
        temp_df = pd.DataFrame([data_row])

        # Combine the temporary data frame with the output data frame
        output_dataframe = pd.concat([output_dataframe, temp_df], axis=0, ignore_index=True)

        # Returns the status and the manipulated data frame
        return True, output_dataframe
    else:
        return False,  output_dataframe

The usage for the function.

columns = ['workspace_id', 'category_name', 'document_id', 'dataset', 'text', 'uri', 'element_metadata', 'label', 'label_type']
df_train_test_data = pd.DataFrame(columns=columns)

for row in range(len(df_input_data)):
 
    verify_result = False
    
    verify_result, df_train_test_data = verify_and_append_data('category_name', 'label', 'kubernetes_true', 'True', row, df_input_data, df_train_test_data)  
    verify_result, df_train_test_data = verify_and_append_data('category_name', 'label', 'watson_nlp_true', 'True', row, df_input_data, df_train_test_data)   
    verify_result, df_train_test_data = verify_and_append_data('category_name', 'label', 'kubernetes_false', 'True', row, df_input_data, df_train_test_data)
    verify_result, df_train_test_data = verify_and_append_data('category_name', 'label', 'watson_nlp_false', 'True', row, df_input_data, df_train_test_data)

# show the different length of the two DataFrames
print(len(df_input_data), len(df_train_test_data))
df_train_test_data.head()

This is an example result of an execution inside a Jupyter Notebook.

1067 342

workspace_idcategory_namedocument_iddatasettexturielement_metadatalabellabel_type
3Blog_post_01kubernetes_true“#doc_id47”pre_proce_level_2_v2“Visit the hands-on workshop Use a IBM Cloud t…pre_proce_level_2_v2-“#doc_id47”-10{}TrueStandard

I hope this was useful to you, and let’s see what’s next?

Greetings,

Thomas

#python, #jupyternotebook, #dataframe, #pandas, #concat

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at WordPress.com.

Up ↑

%d bloggers like this: