This blog post contains an extract of a Jupyter Notebook
related to how to convert XML data to comma separated value
in a Jupyter Notebook
.
The objective is to extract the author, publication date, post name, and post content as text from an xml export file of a wordpress.com
blog. The extracted blog data will be used later as training and testing data input for a classification model.
I haven’t found any documentation of the xml schema for this export. The lxml - XML and HTML with Python
library is used to parse the xml structure and to remove HTML tags. The data we are going to find during the parsing will be saved in a comma separated value
CSV file. This will be done by using the CSV library for Python
.
XML format
The following code shows an example extract of wordpress.com
xml export format.
<?xml version="1.0" encoding="UTF-8"?>
...
<item>
...
<content:encoded>
<![CDATA[Thanks for joining me! <blockquote>Good company in a journey makes the way seem shorter. — Izaak Walton</blockquote><img class="wp-image-7 size-full" src="https://twentysixteendemo.files.wordpress.com/2015/11/post.png" alt="post" width="1000" height="563" />]]>
</content:encoded>
<title>
<![CDATA[The Journey Begins]]>
</title>
<dc:creator>
thomassuedbroecker
</dc:creator>
<pubDate>
Mon, 03 Sep 2018 11:32:18 +0000
</pubDate>
<wp:post_name>
the-journey-begins
</wp:post_name>
...
</item>
...
The <content:encoded>
xml entry for the post content contains HTML
code. To remove the HTML tags we use the HTML library
. The following code shows the post content including the HTML
tags.
Thanks for joining me! <blockquote>Good company in a journey makes the way seem shorter. — Izaak Walton</blockquote><img class="wp-image-7 size-full" src="https://twentysixteendemo.files.wordpress.com/2015/11/post.png" alt="post" width="1000" height="563" />
Python source code
The following example code extract shows how to convert XML data to CSV format.
- Import the libraries
from xml.dom import minidom
xml_doc = minidom.parse(xml_file_name);
from lxml import html
from lxml.html.clean import clean_html
import csv
csv_file_name = "data.csv"
csvfile = open(csv_file_name,'w',encoding='utf-8')
csvfile_writer = csv.writer(csvfile)
- Extract the data
# init csv
csv_line = ['title', 'pubdate', 'post']
csvfile_writer.writerow(csv_line)
i = 0
for node in xml_doc.childNodes: # level 1
i = i + 1
if node.childNodes.length > 0: # level 1.1
nodes = node.childNodes
j = 1
for node in nodes:
j = j + 1
if node.childNodes.length > 0: # level 1.1.1
nodes = node.childNodes
k = 1
for node in nodes:
k = k + 1
if (node.nodeName == "item"):
if node.childNodes.length > 0: # level 1.1.1.1
nodes = node.childNodes
l = 1
for node in nodes:
l = l + 1
# filter needed nodes for the next level
if ((node.nodeName == "content:encoded") or (node.nodeName == "title") or (node.nodeName == "dc:creator") or (node.nodeName == "pubDate") or (node.nodeName == "wp:post_name")):
nodeName = node.nodeName
if node.childNodes.length > 0:
nodes = node.childNodes
m = 1
for node in nodes:
m = m + 1
if (nodeName == "content:encoded"):
# remove html tags from the text
html_text = node.nodeValue
tree = html.fromstring(html_text)
post_text = clean_html(tree).text_content().strip()
# remove commas from the text
clean_post = post_text.replace(',',' ')
else:
if (nodeName == "title"):
title = node.nodeValue
# remove commas from the text
clean_title = title.replace(',',' ')
else:
if (nodeName == "pubDate"):
pubdate = node.nodeValue
# remove commas from the text
clean_pubdate = pubdate.replace(',',' ')
csv_line = [clean_title, clean_pubdate, clean_post]
csvfile_writer.writerow(csv_line)
csvfile.close()
Below you can see an example of a table for the XML to CSV extraction, created with an open-source data analysis and manipulation tool for Python called pandas.
import pandas as pd
df = pd.read_csv(csv_file_name)
df.head()
title | pubdate | post | |
0 | The Journey Begins | Mon 03 Sep 2018 11:32:18 +0000 | Thanks for joining me! Good company in a journ… |
1 | “What are major elements to organize my servic… | Tue 04 Sep 2018 13:48:57 +0000 | Hello I just started in the Developer Advocate… |
2 | How to use VCAP with a IAM enabled service in … | Thu 06 Sep 2018 15:12:44 +0000 | Hello this blog is relevant for you if you us… |
I hope this was useful to you, and let’s see what’s next?
Greetings,
Thomas
#python, #jupyternotebook, #convert_xml_to_csv, #wordpress, #convert, #xml, #csv
Leave a Reply