Reputation: 41
The link below gives us the list of ingredients in recipelist. I would like to extract the names of the ingredient and save it to another file using python. http://stream.massey.ac.nz/file.php/6087/Eva_Material/Tutorials/recipebook.xml
So far I have tried using the following code, but it gives me the complete recipe not the names of the ingredients:
from xml.sax.handler import ContentHandler
import xml.sax
import sys
def recipeBook():
path = "C:\Users\user\Desktop"
basename = "recipebook.xml"
filename = path+"\\"+basename
file=open(filename,"rt")
# find contents
contents = file.read()
class textHandler(ContentHandler):
def characters(self, ch):
sys.stdout.write(ch.encode("Latin-1"))
parser = xml.sax.make_parser()
handler = textHandler( )
parser.setContentHandler(handler)
parser.parse("C:\Users\user\Desktop\\recipebook.xml")
file.close()
How do I extract the name of each ingredient and save them to another file?
Upvotes: 2
Views: 21821
Reputation: 41
@Neha
I guess you have solved your request by now, here is a little piece I put together using the tutorial at http://lxml.de/tutorial.html. The XML file is saved in 'rough_data.xml'
import xml.etree.cElementTree as etree
xmlDoc = open('rough_data.xml', 'r')
xmlDocData = xmlDoc.read()
xmlDocTree = etree.XML(xmlDocData)
for ingredient in xmlDocTree.iter('ingredient'):
print ingredient[0].text
To all experienced Python programmers reading this, kindly improve this "newbie" code.
Note: The lxml package looks very good, it definitely worth using. Thanks
Upvotes: 3
Reputation: 1508
Some time ago I made a series of screencasts that explain how to collect data from sites. The code is in Python and there are 2 videos about XML parsing with the lxml library. All the videos are posted here: http://railean.net/index.php/2012/01/27/fortune-cowsay-python-video-tutorial
The ones you want are:
You'll learn how to write and test XPath queries, as well as how to run such queries in Python. The examples are straightforward, I hope you'll find them helpful.
Upvotes: 0
Reputation: 4770
Please place the relevant XML text in order to receive a proper answer. Also please consider using lxml for anything xml specific (including html) .
try this :
from lxml import etree
tree=etree.parse("your xml here")
all_recipes=tree.xpath('./recipebook/recipe')
recipe_names=[x.xpath('recipe_name/text()') for x in all_recipes]
ingredients=[x.getparent().xpath('../ingredient_list/ingredients') for x in recipe_names]
ingredient_names=[x.xpath('ingredient_name/text()') for x in ingredients]
Here is the beginning only, but i think you get the idea from here -> get the parent from each ingredient_name and the ingredients/quantities from there and so on. You can't really do any other kind of search i think due to the structured nature of the document.
you can read more on [www.lxml.de]
Upvotes: 1