Reputation: 5840
I have this XML file:
<movie id = 0>
<Movie_name>The Shawshank Redemption </Movie_name>
<Address>http://www.imdb.com/title/tt0111161/
</Address>
<year>1994 </year>
<stars>Tim Robbins Morgan Freeman Bob Gunton </stars>
<plot> plot...
</plot>
<keywords>Reviews, Showtimes</keywords>
</movie>
<movie id = 1>
<Movie_name>Inglourious Basterds </Movie_name>
<Address>http://www.imdb.com/title/tt0361748/
</Address>
<year>2009 </year>
<stars>Brad Pitt Mélanie Laurent Christoph Waltz </stars>
<plot>plot/...
</plot>
<keywords>Reviews, credits </keywords>
</movie>
How can iterate the file extracting for each movie its speciffic data? I mean for movie 0: its name, address, year and so on.
The input file structure is mandatory, so data extraction can be done while looping.
Much thanks.
Upvotes: 1
Views: 1269
Reputation: 20762
The BeutifulSoup is more forgiving and it can be also used for HTML (where some enclosing tags are optional). The ElementTree can be used only if the XML is valid. You can make it partly valid by wrapping the fragment to a single element. The attribute values must be enclosed in quotes. Try the following approach where the Movie
class was created to capture the information from one movie element. The class is derived from dict to be as flexible as dict; however, you can create your own methods to return processed values from the collected information:
# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET
class Movie(dict):
def __init__(self, movie_element):
assert movie_element.tag == 'movie' # we are able to process only that
self['id'] = movie_element.attrib['id']
for e in movie_element:
self[e.tag] = e.text.strip()
def name(self):
return self['Movie_name']
def url(self):
return self['Address']
def year(self):
return self['year']
def stars(self):
return self['stars']
def plot(self):
return self['plot']
def keywords(self):
return self['keywords']
def __str__(self):
lst = []
lst.append(self.name() + ' (' + self.year() + ')')
lst.append(self.stars())
lst.append(self.url())
return '\n'.join(lst)
fragment = '''\
<movie id = "0">
<Movie_name>The Shawshank Redemption </Movie_name>
<Address>http://www.imdb.com/title/tt0111161/
</Address>
<year>1994 </year>
<stars>Tim Robbins Morgan Freeman Bob Gunton </stars>
<plot> plot...
</plot>
<keywords>Reviews, Showtimes</keywords>
</movie>
<movie id = "1">
<Movie_name>Inglourious Basterds </Movie_name>
<Address>http://www.imdb.com/title/tt0361748/
</Address>
<year>2009 </year>
<stars>Brad Pitt Melanie Laurent Christoph Waltz </stars>
<plot>plot/...
</plot>
<keywords>Reviews, credits </keywords>
</movie>
'''
fixed_fragment = '<root>\n' + fragment + '</root>'
##print fixed_fragment
tree = ET.fromstring(fixed_fragment)
movies = []
for m in tree:
movies.append(Movie(m))
for movie in movies:
print '\n------------------'
print movie
It prints on my console:
------------------
The Shawshank Redemption (1994)
Tim Robbins Morgan Freeman Bob Gunton
http://www.imdb.com/title/tt0111161/
------------------
Inglourious Basterds (2009)
Brad Pitt Melanie Laurent Christoph Waltz
http://www.imdb.com/title/tt0361748/
Notice that I have replaced the non-ASCII characters -- the problem with encoding is to be solved separately.
Upvotes: 2
Reputation: 10923
EDIT -- taking on board the improved XML input
I would strongly recommend trying to validate your input as in the comment by @Lattyware. I find that with invalid XML and HTML, BeautifulSoup does a good job of recovering something usable. Here is what it does with a quick try:
from BeautifulSoup import BeautifulSoup
# Note: I have added the <movielist> root element
xml = """<movielist>
<movie id = 0>
<Movie_name>The Shawshank Redemption </Movie_name>
<Address>http://www.imdb.com/title/tt0111161/
</Address>
<year>1994 </year>
<stars>Tim Robbins Morgan Freeman Bob Gunton </stars>
<plot> plot...
</plot>
<keywords>Reviews, Showtimes</keywords>
</movieNum>
<movie id = 1>
<Movie_name>Inglourious Basterds </Movie_name>
<Address>http://www.imdb.com/title/tt0361748/
</Address>
<year>2009 </year>
<stars>Brad Pitt Mélanie Laurent Christoph Waltz </stars>
<plot>plot/...
</plot>
<keywords>Reviews, credits </keywords>
</movieNum>
</movielist>"""
soup = BeautifulSoup(xml)
movies = soup.findAll('movie')
for movie in movies:
id_tag = movie['id']
name = movie.find("movie_name").text
url = movie.find("address").text
year = movie.find("year").text
stars = movie.find("stars").text
plot = movie.find("plot").text
keywords = movie.find("keywords").text
for item in (id_tag, name, url, year, stars, plot, keywords):
print item
print '=' * 50
This will output the following (the ID tag is accessible now):
0
The Shawshank Redemption
http://www.imdb.com/title/tt0111161/
1994
Tim Robbins Morgan Freeman Bob Gunton
plot...
Reviews, Showtimes
==================================================
1
Inglourious Basterds
http://www.imdb.com/title/tt0361748/
2009
Brad Pitt Mélanie Laurent Christoph Waltz
plot/...
Reviews, credits
==================================================
It hopefully gives you a start... It can only get better from here.
Upvotes: 3
Reputation: 88987
You will want to check out xml.etree.ElementTree
.
I'd also note that what you have there is not valid XML, so you might run into issues. Valid XML would probably look more like this:
<movie id="0">
<name>The Shawshank Redemption</name>
<url>http://www.imdb.com/title/tt0111161/</url>
<year>1994</year>
<stars>
<star>Tim Robbins</star>
<star>Morgan Freeman</star>
<star>Bob Gunton</star>
</stars>
<plot>plot...</plot>
<keywords>
<keyword>Reviews</keyword>
<keyword>Showtimes</keyword>
</keywords>
</movie>
Note the lowercase tag names and attributes (<movieNum = 0>
doesn't make sense). You will also want an XML declaration (like <?xml version="1.0" encoding="UTF-8" ?>
) at the top. You can validate your XML at XML Validation, or using xmllint, for example.
Once you have valid XML, you can parse it and iterate over it using iterparse()
, or parse it and then iterate over the constructed element tree.
Upvotes: 3