user4691770
user4691770

Reputation:

Parsing XML and writing to CSV file

I am parsing a simple XML document with a simple script that I wrote (with a couple of tweaks). Here is the XML:

<?xml version="1.0" ?>
<library owner="John Franks">
 <book>
  <title>Sandman Volume 1: Preludes and Nocturnes</title>
  <author>Neil Gaiman</author>
 </book>
 <book>
  <title>Good Omens</title>
  <author>Neil Gamain</author>
  <author>Terry Pratchett</author>
 </book>
 <book>
  <title>The Man And The Goat</title>
  <author>Bubber Elderidge</author>
 </book>
 <book>
  <title>Once Upon A Time in LA</title>
  <author>Dr Dre</author>
 </book>
 <book>
  <title>There Will Never Be Justice</title>
  <author>IR Jury</author>
 </book>
 <book>
  <title>Beginning Python</title>
  <author>Peter Norton, et al</author>
 </book>
</library>

And here is my Python script:

from xml.dom.minidom import parse
import xml.dom.minidom
import csv

def writeToCSV(myLibrary):
  csvfile = open('output.csv', 'w')
  fieldnames = ['title', 'author', 'author']
  writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
  writer.writeheader()
  
  books = myLibrary.getElementsByTagName("book")
  for book in books:
    titleValue = book.getElementsByTagName("title")[0].childNodes[0].data
    for author in book.getElementsByTagName("author"):
      authorValue = author.childNodes[0].data
      writer.writerow({'title': titleValue, 'author': authorValue})


doc = parse('library.xml')
myLibrary = doc.getElementsByTagName("library")[0]

# Get book elements in Library
books = myLibrary.getElementsByTagName("book")

# Print each book's title
writeToCSV(myLibrary)

Here is my output:

title,author

Sandman Volume 1: Preludes and Nocturnes,Neil Gaiman

Good Omens,Neil Gamain

Good Omens,Terry Pratchett

The Man And The Goat,Bubber Elderidge

Once Upon A Time in LA,Dr Dre

There Will Never Be Justice,IR Jury

Beginning Python,"Peter Norton, et al"

Notice that the book "Good Omens" has 2 authors, and is displayed on two separate lines. What I would really like is for it to display as follows:

title,author,author

Sandman Volume 1: Preludes and Nocturnes,Neil Gaiman,,

Good Omens,Neil Gamain,Terry Pratchett

The Man And The Goat,Bubber Elderidge,,

Once Upon A Time in LA,Dr Dre,,

There Will Never Be Justice,IR Jury,,

Beginning Python,"Peter Norton, et al",,

As you can see there are 3 columns, so the two authors display on the same line. Those books that have only one author, there is simply a blank entry, so two commas next to each other.

Upvotes: 2

Views: 3797

Answers (3)

Stanislav
Stanislav

Reputation: 2677

Here is one more possible solution:

CODE:

#! /usr/bin/python

from xml.dom.minidom import parse
import xml.dom.minidom
import csv

def writeToCSV(myLibrary):
    with open('output.csv', 'wb') as csvfile:
        writer = csv.writer(csvfile, delimiter=',',quotechar='"', quoting=csv.QUOTE_MINIMAL)
        writer.writerow(['title', 'author', 'author'])
        books = myLibrary.getElementsByTagName("book")
        for book in books:
            titleValue = book.getElementsByTagName("title")[0].childNodes[0].data
            authors = [] # get all the authors in a vector
            for author in book.getElementsByTagName("author"):
                authors.append(author.childNodes[0].data)
            writer.writerow([titleValue] + authors) # write to csv

doc = parse('library.xml')
myLibrary = doc.getElementsByTagName("library")[0]
# Print each book's title
writeToCSV(myLibrary)

OUTPUT:

title,author,author
Sandman Volume 1: Preludes and Nocturnes,Neil Gaiman
Good Omens,Neil Gamain,Terry Pratchett
The Man And The Goat,Bubber Elderidge
Once Upon A Time in LA,Dr Dre
There Will Never Be Justice,IR Jury
Beginning Python,"Peter Norton, et al"

Kind Regards,

Upvotes: 2

zmo
zmo

Reputation: 24802

A good way to solve your problem would be using lxml:

>>> with open('doc.xml') as f:
>>>     doc = etree.XML(f.read())
>>>     for e in doc.xpath('book'):
>>>         print (e.xpath('author/text()'), e.xpath('title/text()')[0])
(['Neil Gaiman'], 'Sandman Volume 1: Preludes and Nocturnes')
(['Neil Gamain', 'Terry Pratchett'], 'Good Omens')
(['Bubber Elderidge'], 'The Man And The Goat')
(['Dr Dre'], 'Once Upon A Time in LA')
(['IR Jury'], 'There Will Never Be Justice')
(['Peter Norton, et al'], 'Beginning Python')

Then to generate your csv, you could do something like:

 with open('output.csv', 'w') as fout:
      fieldnames = ['title', 'authors']
      writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
      writer.writeheader()
      for e in doc.xpath('book'):
         title, authors = e.xpath('author/text()'), e.xpath('title/text()')[0]
         writer.writerow({'title': titleValue, 'author': authors.join(';')})

or:

  with open('output.csv', 'w') as fout:
      fieldnames = ['title', 'author1', 'author2']
      writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
      writer.writeheader()
      for e in doc.xpath('book'):
         title, authors = e.xpath('author/text()'), e.xpath('title/text()')[0]
         author1, author2 = '', ''
         if len(authors) == 2:
             author2 = author[1]
         if len(authors) == 1:
             author1 = author[0]
         writer.writerow({'title': titleValue, 'author1': author1, 'author2': author2})

Upvotes: 3

Roland
Roland

Reputation: 5232

To get the two authors on the same line requires just some basic loop programming. For each title you need to iterate over the entire list to search for another author on the same title.

Or, first sort the list by title, which will bring two authors on adjacent records. You can sort the xml structures directly with xml library calls.

Upvotes: 0

Related Questions