How to convert XML data to CSV in a Jupyter Notebook with Python?

This blog post contains an extract of a Jupyter Notebook related to how to convert XML data to comma separated value in a Jupyter Notebook.

The objective is to extract the author, publication date, post name, and post content as text from an xml export file of a wordpress.com blog. The extracted blog data will be used later as training and testing data input for a classification model.

I haven’t found any documentation of the xml schema for this export. The lxml - XML and HTML with Python library is used to parse the xml structure and to remove HTML tags. The data we are going to find during the parsing will be saved in a comma separated value CSV file. This will be done by using the CSV library for Python.

XML format

The following code shows an example extract of wordpress.com xml export format.

<?xml version="1.0" encoding="UTF-8"?>
...
    <item>
        ...
        <content:encoded>
            <![CDATA[Thanks for joining me! <blockquote>Good company in a journey makes the way seem shorter. — Izaak Walton</blockquote><img class="wp-image-7 size-full" src="https://twentysixteendemo.files.wordpress.com/2015/11/post.png" alt="post" width="1000" height="563" />]]>
        </content:encoded>
        <title>
            <![CDATA[The Journey Begins]]>
        </title>
        <dc:creator>
            thomassuedbroecker
        </dc:creator>
        <pubDate>
            Mon, 03 Sep 2018 11:32:18 +0000
        </pubDate>
        <wp:post_name>
            the-journey-begins
        </wp:post_name>
        ...
    </item>
...

The <content:encoded> xml entry for the post content contains HTML code. To remove the HTML tags we use the HTML library. The following code shows the post content including the HTML tags.

Thanks for joining me! <blockquote>Good company in a journey makes the way seem shorter. — Izaak Walton</blockquote><img class="wp-image-7 size-full" src="https://twentysixteendemo.files.wordpress.com/2015/11/post.png" alt="post" width="1000" height="563" />

Python source code

The following example code extract shows how to convert XML data to CSV format.

  • Import the libraries
from xml.dom import minidom
xml_doc = minidom.parse(xml_file_name);

from lxml import html
from lxml.html.clean import clean_html
import csv

csv_file_name = "data.csv"
csvfile = open(csv_file_name,'w',encoding='utf-8')
csvfile_writer = csv.writer(csvfile)
  • Extract the data
# init csv
csv_line = ['title', 'pubdate', 'post'] 
csvfile_writer.writerow(csv_line)

i = 0
for node in xml_doc.childNodes: # level 1
    i = i + 1 
    
    if node.childNodes.length > 0: # level 1.1
        nodes = node.childNodes
        j = 1
        for node in nodes:
            j = j + 1  
            
            if node.childNodes.length > 0:  # level 1.1.1
                nodes = node.childNodes
                k = 1
                
                for node in nodes:
                    
                    k = k + 1  
                    if (node.nodeName == "item"):
                        
                        if node.childNodes.length > 0:  # level 1.1.1.1
                            nodes = node.childNodes
                            l = 1
    
                            for node in nodes:
                                l = l + 1
                
                                # filter needed nodes for the next level
                                if ((node.nodeName == "content:encoded") or (node.nodeName == "title") or (node.nodeName == "dc:creator") or (node.nodeName == "pubDate") or (node.nodeName == "wp:post_name")):
                                    nodeName = node.nodeName
                                    if node.childNodes.length > 0:
                                        nodes = node.childNodes
                                        m = 1
                                        for node in nodes:
                                            m = m + 1
                                            
                                            if (nodeName == "content:encoded"):
                                                 # remove html tags from the text
                                                html_text = node.nodeValue
                                                tree = html.fromstring(html_text)
                                                post_text = clean_html(tree).text_content().strip()
                                                # remove commas from the text
                                                clean_post = post_text.replace(',',' ')
                                            else:
                                                if (nodeName == "title"):
                                                    title = node.nodeValue
                                                    # remove commas from the text
                                                    clean_title = title.replace(',',' ')
                                                else:
                                                    if (nodeName == "pubDate"):
                                                        pubdate = node.nodeValue
                                                        # remove commas from the text
                                                        clean_pubdate = pubdate.replace(',',' ')

                            csv_line = [clean_title, clean_pubdate, clean_post]   
                            csvfile_writer.writerow(csv_line)
csvfile.close()

Below you can see an example of a table for the XML to CSV extraction, created with an open-source data analysis and manipulation tool for Python called pandas.

import pandas as pd
df = pd.read_csv(csv_file_name)
df.head()
titlepubdatepost
0The Journey BeginsMon 03 Sep 2018 11:32:18 +0000Thanks for joining me! Good company in a journ…
1“What are major elements to organize my servic…Tue 04 Sep 2018 13:48:57 +0000Hello I just started in the Developer Advocate…
2How to use VCAP with a IAM enabled service in …Thu 06 Sep 2018 15:12:44 +0000Hello this blog is relevant for you if you us…

I hope this was useful to you, and let’s see what’s next?

Greetings,

Thomas

#python, #jupyternotebook, #convert_xml_to_csv, #wordpress, #convert, #xml, #csv

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at WordPress.com.

Up ↑

%d bloggers like this: