How to convert XML data to CSV in a Jupyter Notebook with Python?

This blog post contains an extract of a Jupyter Notebook related to how to convert XML data to comma separated value in a Jupyter Notebook.

The objective is to extract the author, publication date, post name, and post content as text from an xml export file of a wordpress.com blog. The extracted blog data will be used later as training and testing data input for a classification model.

I haven’t found any documentation of the xml schema for this export. The lxml - XML and HTML with Python library is used to parse the xml structure and to remove HTML tags. The data we are going to find during the parsing will be saved in a comma separated value CSV file. This will be done by using the CSV library for Python.

XML format

The following code shows an example extract of wordpress.com xml export format.

<?xml version="1.0" encoding="UTF-8"?>
...
    <item>
        ...
        <content:encoded>
            <![CDATA[Thanks for joining me! <blockquote>Good company in a journey makes the way seem shorter. — Izaak Walton</blockquote><img class="wp-image-7 size-full" src="https://twentysixteendemo.files.wordpress.com/2015/11/post.png" alt="post" width="1000" height="563" />]]>
        </content:encoded>
        <title>
            <![CDATA[The Journey Begins]]>
        </title>
        <dc:creator>
            thomassuedbroecker
        </dc:creator>
        <pubDate>
            Mon, 03 Sep 2018 11:32:18 +0000
        </pubDate>
        <wp:post_name>
            the-journey-begins
        </wp:post_name>
        ...
    </item>
...

The <content:encoded> xml entry for the post content contains HTML code. To remove the HTML tags we use the HTML library. The following code shows the post content including the HTML tags.

Thanks for joining me! <blockquote>Good company in a journey makes the way seem shorter. — Izaak Walton</blockquote><img class="wp-image-7 size-full" src="https://twentysixteendemo.files.wordpress.com/2015/11/post.png" alt="post" width="1000" height="563" />

Python source code

The following example code extract shows how to convert XML data to CSV format.

Import the libraries

from xml.dom import minidom
xml_doc = minidom.parse(xml_file_name);

from lxml import html
from lxml.html.clean import clean_html
import csv

csv_file_name = "data.csv"
csvfile = open(csv_file_name,'w',encoding='utf-8')
csvfile_writer = csv.writer(csvfile)

Extract the data

# init csv
csv_line = ['title', 'pubdate', 'post'] 
csvfile_writer.writerow(csv_line)

i = 0
for node in xml_doc.childNodes: # level 1
    i = i + 1 
    
    if node.childNodes.length > 0: # level 1.1
        nodes = node.childNodes
        j = 1
        for node in nodes:
            j = j + 1  
            
            if node.childNodes.length > 0:  # level 1.1.1
                nodes = node.childNodes
                k = 1
                
                for node in nodes:
                    
                    k = k + 1  
                    if (node.nodeName == "item"):
                        
                        if node.childNodes.length > 0:  # level 1.1.1.1
                            nodes = node.childNodes
                            l = 1
    
                            for node in nodes:
                                l = l + 1
                
                                # filter needed nodes for the next level
                                if ((node.nodeName == "content:encoded") or (node.nodeName == "title") or (node.nodeName == "dc:creator") or (node.nodeName == "pubDate") or (node.nodeName == "wp:post_name")):
                                    nodeName = node.nodeName
                                    if node.childNodes.length > 0:
                                        nodes = node.childNodes
                                        m = 1
                                        for node in nodes:
                                            m = m + 1
                                            
                                            if (nodeName == "content:encoded"):
                                                 # remove html tags from the text
                                                html_text = node.nodeValue
                                                tree = html.fromstring(html_text)
                                                post_text = clean_html(tree).text_content().strip()
                                                # remove commas from the text
                                                clean_post = post_text.replace(',',' ')
                                            else:
                                                if (nodeName == "title"):
                                                    title = node.nodeValue
                                                    # remove commas from the text
                                                    clean_title = title.replace(',',' ')
                                                else:
                                                    if (nodeName == "pubDate"):
                                                        pubdate = node.nodeValue
                                                        # remove commas from the text
                                                        clean_pubdate = pubdate.replace(',',' ')

                            csv_line = [clean_title, clean_pubdate, clean_post]   
                            csvfile_writer.writerow(csv_line)
csvfile.close()

Below you can see an example of a table for the XML to CSV extraction, created with an open-source data analysis and manipulation tool for Python called pandas.

import pandas as pd
df = pd.read_csv(csv_file_name)
df.head()

	title	pubdate	post
0	The Journey Begins	Mon 03 Sep 2018 11:32:18 +0000	Thanks for joining me! Good company in a journ…
1	“What are major elements to organize my servic…	Tue 04 Sep 2018 13:48:57 +0000	Hello I just started in the Developer Advocate…
2	How to use VCAP with a IAM enabled service in …	Thu 06 Sep 2018 15:12:44 +0000	Hello this blog is relevant for you if you us…

I hope this was useful to you, and let’s see what’s next?

Greetings,

Thomas

#python, #jupyternotebook, #convert_xml_to_csv, #wordpress, #convert, #xml, #csv

How to convert XML data to CSV in a Jupyter Notebook with Python?

XML format

Python source code

Leave a comment Cancel reply

Blog Stats

XML format

Python source code

Share this:

Related

Leave a comment Cancel reply

Blog Stats