Reputation: 51361
str1 = abdk3<h1>The content we need</h1>aaaaabbb<h2>The content we need2</h2>
We need the contents inside the h1 tag and h2 tag.
What is the best way to do that? Thanks
Thanks for the help!
Upvotes: 1
Views: 256
Reputation: 17520
First bit of advice: DON'T USE REGULAR EXPRESSIONS FOR HTML/XML PARSING!
Now that we've cleared that up, I'd suggest you look at Beautiful Soup. There are other SGML/XML/HTML parsers available for Python. However this one is the favorite for dealing with the sloppy "tag soup" that most of us find out in the real world. It doesn't require that the inputs be standards conformant nor well-formed. If your browser can manage to render it than Beautiful Soup can probably manage to parse it for you.
(Still tempted to use regular expressions for this task? Thinking "it can't be that bad, I just want to extract just what's in the <h1>...</h1>
and <h2>...</h2>
containers." and ... "I'll never need to handle any other corner cases" That way lies madness. The code you write based on that line of reasoning will be fragile. It'll work just well enough to pass your tests and then it will get worse and worse every time you need to fix "just one more thing." Seriously, import a real parser and use it).
Upvotes: 2
Reputation: 90892
The best way if it needs to scale at all would be with something like BeautifulSoup.
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('abdk3<h1>The content we need</h1>aaaaabbb<h2>The content we need2</h2>')
>>> soup.h1
<h1>The content we need</h1>
>>> soup.h1.text
u'The content we need'
>>> soup.h2
<h2>The content we need2</h2>
>>> soup.h2.text
u'The content we need2'
It could be done with a regular expression too but this is probably more what you want. A larger example of what you are wanting could be good. Without knowing quite what you're wanting to parse it's hard to help properly.
Upvotes: 6