Reputation: 173
I am working with my Evernote data - extracted to an xml file. I have parsed the data using BeautifulSoup and here is a sampling of my xml data.
<note>
<title>
Audio and camera roll from STUDY DAY! in San Francisco
</title>
<content>
<![CDATA[<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">
<en-note><div><en-media type="image/jpeg" hash="e3a84de41c9886b93a6921413b8482d5" width="1080" style="" height="1920"/><en-media type="image/jpeg" hash="b907b22a9f2db379aec3739d65ce62db" width="1123" style="" height="1600"/><en-media type="audio/wav" hash="d3fdcd5a487531dc156a8c5ef6000764" style=""/><br/></div>
</en-note>
]]>
</content>
<created>
20130217T153800Z
</created>
<updated>
20130217T154311Z
</updated>
<note-attributes>
<latitude>
37.78670730072799
</latitude>
<longitude>
-122.4171893858559
</longitude>
<altitude>
42
</altitude>
<source>
mobile.iphone
</source>
<reminder-order>
0
</reminder-order>
</note-attributes>
<resource>
<data encoding="base64">
There are two avenues I would like to explore here: 1. Finding and removing Specific tags (in this case ) 2. locating a group/list of tags to extract to another document
This is my current code which parses the xml prettifies it and outputs to a text file.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('myNotes.xml','r'))
with open("file.txt", "w") as f:
f.write(soup.prettify().encode('utf8'))
Upvotes: 0
Views: 2597
Reputation: 13720
If you're using BeautifulSoup
, you could use the getText()
method to strip out the tags in the child elements and get one consolidated text
source.getText()
Upvotes: 1
Reputation: 329
You can search nodes by name
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open('myNotes.xml', 'r'))
source = soup.source
print source
#<source>
# mobile.iphone
#</source>
source = soup.source
print source.string
# mobile.iphone
Another way to do it, findAll method:
for tag in soup.findAll('source'):
print tag.string
if you want to print every node stripping tags, this should do the job:
for tag in soup.findAll():
print tag.string
Hope it helps.
EDIT:________
BeautifulSoup asumes you know the structure, although by definition xml is a structured data storage. So you need to give a guideline to BS to parse your xml.
row = []
title = soup.note.title.string
created = soup.note.created.string
row.append(title)
row.append(created)
Now you only have to iterate over xml.
Upvotes: 2