user469652
user469652

Reputation: 51361

Python regular expression

str1 = abdk3<h1>The content we need</h1>aaaaabbb<h2>The content we need2</h2>

We need the contents inside the h1 tag and h2 tag.

What is the best way to do that? Thanks

Thanks for the help!

Upvotes: 1

Views: 256

Answers (2)

Jim Dennis
Jim Dennis

Reputation: 17520

First bit of advice: DON'T USE REGULAR EXPRESSIONS FOR HTML/XML PARSING!

Now that we've cleared that up, I'd suggest you look at Beautiful Soup. There are other SGML/XML/HTML parsers available for Python. However this one is the favorite for dealing with the sloppy "tag soup" that most of us find out in the real world. It doesn't require that the inputs be standards conformant nor well-formed. If your browser can manage to render it than Beautiful Soup can probably manage to parse it for you.

(Still tempted to use regular expressions for this task? Thinking "it can't be that bad, I just want to extract just what's in the <h1>...</h1> and <h2>...</h2> containers." and ... "I'll never need to handle any other corner cases" That way lies madness. The code you write based on that line of reasoning will be fragile. It'll work just well enough to pass your tests and then it will get worse and worse every time you need to fix "just one more thing." Seriously, import a real parser and use it).

Upvotes: 2

Chris Morgan
Chris Morgan

Reputation: 90892

The best way if it needs to scale at all would be with something like BeautifulSoup.

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('abdk3<h1>The content we need</h1>aaaaabbb<h2>The content we need2</h2>')
>>> soup.h1
<h1>The content we need</h1>
>>> soup.h1.text
u'The content we need'
>>> soup.h2
<h2>The content we need2</h2>
>>> soup.h2.text
u'The content we need2'

It could be done with a regular expression too but this is probably more what you want. A larger example of what you are wanting could be good. Without knowing quite what you're wanting to parse it's hard to help properly.

Upvotes: 6

Related Questions