A way to build a DataFrame from a DataFrame in a Jupyter Notebook with Python

This blog post contains an extract of a Jupyter Notebook and shows a way how to create a new DataFrame by building an extraction of two specific columns and their values of an existing DataFrame.

The background is that I labeled data with Label Sleuth. Label Sleuth is an open-source no-code system for text annotation and building text classifiers, and to export the labeled train set. The image below shows the export dialog.

Here is an extract of the loaded data in a DataFrame inside a Jupyter Notebook for the exported data from Label Sleuth.

import pandas as pd
df_input_data = pd.read_csv(input_csv_file_name)
df_input_data.head()

	workspace_id	category_name	document_id	dataset	text	uri	element_metadata	label	label_type
0	Blog_post_01	kubernetes_true	“#doc_id194”	pre_proce_level_2_v2	“There is an article called Let’s embed AI int…	pre_proce_level_2_v2-“#doc_id194”-5	{} False	Standard
1	Blog_post_01	kubernetes_true	“#doc_id6”	pre_proce_level_2_v2	“For example; to add my simple git and cloud h…	pre_proce_level_2_v2-“#doc_id6”-10	{}	False	Standard
2	Blog_post_01	kubernetes_true	“#doc_id2”	pre_proce_level_2_v2	“The enablement of the two models is availabl…	pre_proce_level_2_v2-“#doc_id2”-14	{} False	Standard
3	Blog_post_01	kubernetes_true	“#doc_id47”	pre_proce_level_2_v2	“Visit the hands-on workshop Use a IBM Cloud t…	pre_proce_level_2_v2-“#doc_id47”-10	{}	True	Standard
4	Blog_post_01	kubernetes_true	“#doc_id1”	pre_proce_level_2_v2	“This is a good starting point to move on to t…	pre_proce_level_2_v2-“#doc_id1”-58	{}	False	Standard

In this example, we want to find all the values for the column category_name which are equal to kubernetes_true and the column label contains the value True at the same index position.

The following Python function verify_and_append_data does the job to build a DataFrame by an extract of these two specific columns and their values of an existing DataFrame.

This function verify_and_append_data will be used during an iteration of the input DataFrame.

The function is tailored for this column data structure ['workspace_id', 'category_name', 'document_id', 'dataset', 'text', 'uri', 'element_metadata', 'label', 'label_type'] and contains the parameters:

The search_column_1 and search_column_2 for the definition of columns to be used to get the search values.
The search_val_1 and search_val_2 that does specify the values you are searching for in the columns.
The index to define the current index position of the input DataFrame.
The input_dataframe is the DataFrame that contains the input data.
The output_dataframe is the DataFrame which does contain the extracted data.

# The function uses the following data structure  ['workspace_id', 'category_name', 'document_id', 'dataset', 'text', 'uri', 'element_metadata', 'label', 'label_type']
def verify_and_append_data ( search_column_1, search_column_2, search_val_1, search_val_2, index, input_dataframe, output_dataframe):
    
    reference_1 = input_dataframe.loc[index, search_column_1]
    reference_2 = str(input_dataframe.loc[index, search_column_2])
    
    if ((reference_1 == search_val_1) and (reference_2 == search_val_2)):
        
        # get the existing data from the input dataframe
        workspace_id     = input_dataframe._get_value(index, 'workspace_id', takeable=False)
        category_name    = input_dataframe._get_value(index, 'category_name', takeable=False)
        document_id      = input_dataframe._get_value(index, 'document_id', takeable=False)
        dataset          = input_dataframe._get_value(index, 'dataset', takeable=False)
        text             = input_dataframe._get_value(index, 'text', takeable=False)
        uri              = input_dataframe._get_value(index, 'uri', takeable=False)
        element_metadata = input_dataframe._get_value(index, 'element_metadata', takeable=False)
        label            = input_dataframe._get_value(index, 'label', takeable=False)
        label_type       = input_dataframe._get_value(index, 'label_type', takeable=False) 
        
        # save the data for insertion
        data_row = {'workspace_id':workspace_id ,'category_name':category_name,'document_id':document_id ,'dataset':dataset ,'text':text ,'uri':uri ,'element_metadata':element_metadata,'label':label ,'label_type':label_type}
               
        # Create a temporary data frame that will be combined with output data frame
        temp_df = pd.DataFrame([data_row])

        # Combine the temporary data frame with the output data frame
        output_dataframe = pd.concat([output_dataframe, temp_df], axis=0, ignore_index=True)

        # Returns the status and the manipulated data frame
        return True, output_dataframe
    else:
        return False,  output_dataframe

The usage for the function.

columns = ['workspace_id', 'category_name', 'document_id', 'dataset', 'text', 'uri', 'element_metadata', 'label', 'label_type']
df_train_test_data = pd.DataFrame(columns=columns)

for row in range(len(df_input_data)):
 
    verify_result = False
    
    verify_result, df_train_test_data = verify_and_append_data('category_name', 'label', 'kubernetes_true', 'True', row, df_input_data, df_train_test_data)  
    verify_result, df_train_test_data = verify_and_append_data('category_name', 'label', 'watson_nlp_true', 'True', row, df_input_data, df_train_test_data)   
    verify_result, df_train_test_data = verify_and_append_data('category_name', 'label', 'kubernetes_false', 'True', row, df_input_data, df_train_test_data)
    verify_result, df_train_test_data = verify_and_append_data('category_name', 'label', 'watson_nlp_false', 'True', row, df_input_data, df_train_test_data)

# show the different length of the two DataFrames
print(len(df_input_data), len(df_train_test_data))
df_train_test_data.head()

This is an example result of an execution inside a Jupyter Notebook.

1067 342

	workspace_id	category_name	document_id	dataset	text	uri	element_metadata	label	label_type
3	Blog_post_01	kubernetes_true	“#doc_id47”	pre_proce_level_2_v2	“Visit the hands-on workshop Use a IBM Cloud t…	pre_proce_level_2_v2-“#doc_id47”-10	{}	True	Standard

I hope this was useful to you, and let’s see what’s next?

Greetings,

Thomas

#python, #jupyternotebook, #dataframe, #pandas, #concat

A way to build a DataFrame from a DataFrame in a Jupyter Notebook with Python

Leave a comment Cancel reply

Blog Stats

Share this:

Related

Leave a comment Cancel reply

Blog Stats