Mp0int
Mp0int

Reputation: 18727

Regex how to use 'or' for string matching

I want to parse a web page and find specific patterns using regex on Python.

My Example page have:

<input type="checkbox" name="some name....">
<input type="text", name="somemore name...">
<input type="radio" name="other name...">

And i want to find all matcihng name values of radio and checkbox inputs.

<input type="checkbox" name="(.*?)".*?>
<input type="radio" name="(.*?)".*?>

But i can not figure out how to combine these to regex to a single one?

EDIT: That question might switch to other directions. But it is better for me to tell what i want to do and is my choice of regex usage really suitable for that...

I must query a subscriber and get some basic info about the subscriber and a list of available loans and charges of the sbscriber. RElated module has many scripts that do that kind of job with regex. I also use SGMLparser for some part in my code. But i sometimes see SGML parser fails to parse HTML (did not dig it why it fails but basic reason is unexpected char type errors). So, i must be sure that i van either handle all type of HTML code, or keep on doing this by regex.

CONCLUSION: It is the best choice to use HTMLParser, and using regex is simple a verry bad idea... That is what i get from this question... But since the Question itself is more about regex matcihng then regex usage in thml, i decided to accept the answer abour regex...

Upvotes: 0

Views: 465

Answers (3)

Joe
Joe

Reputation: 47629

<input type="(checkbox|radio)" name="(?P<name>.*?)".*?>

I've also put a capture group name in there for ease of extraction.

But the old rule applies, don't use regex for parsing html. It's very fragile. What if the code you are parsing changed to be <input class="aha" type="checkbox" name="some name...."> overnight? Use the HTMLParser class or BeautifulSoup.

http://docs.python.org/library/htmlparser.html

http://www.crummy.com/software/BeautifulSoup/

Upvotes: 4

npinti
npinti

Reputation: 52185

You should never process HTML with Regex... there are plenty of threads here showing you why. Maybe you can check out this previous SO thread in which various HTML parsers for Python are discussed.

Upvotes: 2

FailedDev
FailedDev

Reputation: 26930

This?

<input type="(?:checkbox|radio)" name="(.*?)".*?>

While this works... It is not very robust...

Upvotes: 2

Related Questions