moon
moon

Reputation: 11

python search from tag

i need help with python programming: i need a command which can search all the words between tags from a text file. for example in the text file has <concept> food </concept>. i need to search all the words between <concept> and </concept> and display them. can anybody help please.......

Upvotes: 1

Views: 3033

Answers (3)

nkrkv
nkrkv

Reputation: 7098

There is a great library for HTML/XML traversing named BeautifulSoup. With it:

from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(open('myfile.xml', 'rt').read())
for t in soup.findAll('concept'):
   print t.string

Upvotes: 3

phimuemue
phimuemue

Reputation: 35983

Have a look at regular expressions. http://docs.python.org/library/re.html

If you want to have for example the tag <i>, try

text = "text to search. <i>this</i> is the word and also <i>that</i> end"
import re
re.findall("<i>(.*?)</i>",text)

Here's a short explanation how findall works: It looks in the given string for a given regular expression. The regular expression is <i>(.*?)</i>:

  • <i> denotes just the opening tag <i>
  • (.*?) creates a group and matches as much as possible until it comes to the first
  • </i>, which concludes the tag

Note that the above solution does not mach something like

<i> here's a line
break </i>

Since you just wanted to extract words.

However, it is of course possible to do so:

re.findall("<i>(.*?)</i>",text,re.DOTALL)

Upvotes: 1

Aaron Digulla
Aaron Digulla

Reputation: 328574

  1. Load the text file into a string.
  2. Search the string for the first occurrence of <concept> using pos1 = s.find('<concept>')
  3. Search for </concept> using pos2 = s.find('</concept>', pos1)

The words you seek are then s[pos1+len('<concept>'):pos2]

Upvotes: 3

Related Questions