Reputation:
I am parsing a simple XML document with a simple script that I wrote (with a couple of tweaks). Here is the XML:
<?xml version="1.0" ?>
<library owner="John Franks">
<book>
<title>Sandman Volume 1: Preludes and Nocturnes</title>
<author>Neil Gaiman</author>
</book>
<book>
<title>Good Omens</title>
<author>Neil Gamain</author>
<author>Terry Pratchett</author>
</book>
<book>
<title>The Man And The Goat</title>
<author>Bubber Elderidge</author>
</book>
<book>
<title>Once Upon A Time in LA</title>
<author>Dr Dre</author>
</book>
<book>
<title>There Will Never Be Justice</title>
<author>IR Jury</author>
</book>
<book>
<title>Beginning Python</title>
<author>Peter Norton, et al</author>
</book>
</library>
And here is my Python script:
from xml.dom.minidom import parse
import xml.dom.minidom
import csv
def writeToCSV(myLibrary):
csvfile = open('output.csv', 'w')
fieldnames = ['title', 'author', 'author']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
books = myLibrary.getElementsByTagName("book")
for book in books:
titleValue = book.getElementsByTagName("title")[0].childNodes[0].data
for author in book.getElementsByTagName("author"):
authorValue = author.childNodes[0].data
writer.writerow({'title': titleValue, 'author': authorValue})
doc = parse('library.xml')
myLibrary = doc.getElementsByTagName("library")[0]
# Get book elements in Library
books = myLibrary.getElementsByTagName("book")
# Print each book's title
writeToCSV(myLibrary)
Here is my output:
title,author
Sandman Volume 1: Preludes and Nocturnes,Neil Gaiman
Good Omens,Neil Gamain
Good Omens,Terry Pratchett
The Man And The Goat,Bubber Elderidge
Once Upon A Time in LA,Dr Dre
There Will Never Be Justice,IR Jury
Beginning Python,"Peter Norton, et al"
Notice that the book "Good Omens" has 2 authors, and is displayed on two separate lines. What I would really like is for it to display as follows:
title,author,author
Sandman Volume 1: Preludes and Nocturnes,Neil Gaiman,,
Good Omens,Neil Gamain,Terry Pratchett
The Man And The Goat,Bubber Elderidge,,
Once Upon A Time in LA,Dr Dre,,
There Will Never Be Justice,IR Jury,,
Beginning Python,"Peter Norton, et al",,
As you can see there are 3 columns, so the two authors display on the same line. Those books that have only one author, there is simply a blank entry, so two commas next to each other.
Upvotes: 2
Views: 3797
Reputation: 2677
Here is one more possible solution:
CODE:
#! /usr/bin/python
from xml.dom.minidom import parse
import xml.dom.minidom
import csv
def writeToCSV(myLibrary):
with open('output.csv', 'wb') as csvfile:
writer = csv.writer(csvfile, delimiter=',',quotechar='"', quoting=csv.QUOTE_MINIMAL)
writer.writerow(['title', 'author', 'author'])
books = myLibrary.getElementsByTagName("book")
for book in books:
titleValue = book.getElementsByTagName("title")[0].childNodes[0].data
authors = [] # get all the authors in a vector
for author in book.getElementsByTagName("author"):
authors.append(author.childNodes[0].data)
writer.writerow([titleValue] + authors) # write to csv
doc = parse('library.xml')
myLibrary = doc.getElementsByTagName("library")[0]
# Print each book's title
writeToCSV(myLibrary)
OUTPUT:
title,author,author
Sandman Volume 1: Preludes and Nocturnes,Neil Gaiman
Good Omens,Neil Gamain,Terry Pratchett
The Man And The Goat,Bubber Elderidge
Once Upon A Time in LA,Dr Dre
There Will Never Be Justice,IR Jury
Beginning Python,"Peter Norton, et al"
Kind Regards,
Upvotes: 2
Reputation: 24802
A good way to solve your problem would be using lxml:
>>> with open('doc.xml') as f:
>>> doc = etree.XML(f.read())
>>> for e in doc.xpath('book'):
>>> print (e.xpath('author/text()'), e.xpath('title/text()')[0])
(['Neil Gaiman'], 'Sandman Volume 1: Preludes and Nocturnes')
(['Neil Gamain', 'Terry Pratchett'], 'Good Omens')
(['Bubber Elderidge'], 'The Man And The Goat')
(['Dr Dre'], 'Once Upon A Time in LA')
(['IR Jury'], 'There Will Never Be Justice')
(['Peter Norton, et al'], 'Beginning Python')
Then to generate your csv, you could do something like:
with open('output.csv', 'w') as fout:
fieldnames = ['title', 'authors']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for e in doc.xpath('book'):
title, authors = e.xpath('author/text()'), e.xpath('title/text()')[0]
writer.writerow({'title': titleValue, 'author': authors.join(';')})
or:
with open('output.csv', 'w') as fout:
fieldnames = ['title', 'author1', 'author2']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for e in doc.xpath('book'):
title, authors = e.xpath('author/text()'), e.xpath('title/text()')[0]
author1, author2 = '', ''
if len(authors) == 2:
author2 = author[1]
if len(authors) == 1:
author1 = author[0]
writer.writerow({'title': titleValue, 'author1': author1, 'author2': author2})
Upvotes: 3
Reputation: 5232
To get the two authors on the same line requires just some basic loop programming. For each title you need to iterate over the entire list to search for another author on the same title.
Or, first sort the list by title, which will bring two authors on adjacent records. You can sort the xml structures directly with xml library calls.
Upvotes: 0