Reputation: 2355
This is the sample xml document :
<bookstore>
<book category="COOKING">
<title lang="english">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>300.00</price>
</book>
<book category="CHILDREN">
<title lang="english">Harry Potter</title>
<author>J K. Rowling </author>
<year>2005</year>
<price>625.00</price>
</book>
</bookstore>
I want to extract the text without specifying the elements how can i do this , because i have 10 such documents. I want so because my problem is that user is entering some word which I don't know , it has to be searched in all of the 10 xml documents in their respective text portions. For this to happen I should know where the text lies without knowing about the element. One more thing that all these documents are different.
Please Help!!
Upvotes: 0
Views: 3589
Reputation: 3000
If you want to call grep from inside python, see the discussion here, especially this post.
If you want to search through all the files in a directory you could try something like this using the glob module:
import glob
import os
import re
p = re.compile('>.*<')
os.chdir("./")
for files in glob.glob("*.xml"):
file = open(files, "r")
line = file.read()
list = map(lambda x:x.lstrip('>').rstrip('<'), p.findall(line))
print list
print
This searches iterates through all the files in the directory, opens each file and exteacts text matching the regexp.
Output:
['Everyday Italian', 'Giada De Laurentiis', '2005', '300.00', 'Harry Potter', 'J
K. Rowling ', '2005', '625.00']
EDIT: Updated code to extract only the text elements from the xml.
Upvotes: 0
Reputation: 174622
You could simply strip out any tags:
>>> import re
>>> txt = """<bookstore>
... <book category="COOKING">
... <title lang="english">Everyday Italian</title>
... <author>Giada De Laurentiis</author>
... <year>2005</year>
... <price>300.00</price>
... </book>
...
... <book category="CHILDREN">
... <title lang="english">Harry Potter</title>
... <author>J K. Rowling </author>
... <year>2005</year>
... <price>625.00</price>
... </book>
... </bookstore>"""
>>> exp = re.compile(r'<.*?>')
>>> text_only = exp.sub('',txt).strip()
>>> text_only
'Everyday Italian\n Giada De Laurentiis\n 2005\n 300.00\n
\n\n \n Harry Potter\n J K. Rowling \n 2005\n 6
25.00'
But if you just want to search files for some text in Linux, you can use grep
:
burhan@sandbox:~$ grep "Harry Potter" file.xml
<title lang="english">Harry Potter</title>
If you want to search in a file, use the grep
command above, or open the file and search for it in Python:
>>> import re
>>> exp = re.compile(r'<.*?>')
>>> with open('file.xml') as f:
... lines = ''.join(line for line in f.readlines())
... text_only = exp.sub('',lines).strip()
...
>>> if 'Harry Potter' in text_only:
... print 'It exists'
... else:
... print 'It does not'
...
It exists
Upvotes: -1
Reputation: 142176
Using the lxml library with an xpath query is possible:
xml="""<bookstore>
<book category="COOKING">
<title lang="english">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>300.00</price>
</book>
<book category="CHILDREN">
<title lang="english">Harry Potter</title>
<author>J K. Rowling </author>
<year>2005</year>
<price>625.00</price>
</book>
</bookstore>
"""
from lxml import etree
root = etree.fromstring(xml).getroot()
root.xpath('/bookstore/book/*/text()')
# ['Everyday Italian', 'Giada De Laurentiis', '2005', '300.00', 'Harry Potter', 'J K. Rowling ', '2005', '625.00']
Although you don't get the category....
Upvotes: 2