Reputation: 785

How to use regular expressions to pull a substring? (screen scraping)

Hey guys, i'm really trying to understand regular expressions while scraping a site, i've been using it in my code enough to pull the following, but am stuck here. I need to quickly grab this:

http://www.example.com/online/store/TitleDetail?detail&sku=123456789

from this:

('<a href="javascript:if(handleDoubleClick(this.id)){window.location=\'http://www.example.com/online/store/TitleDetail?detail&sku=123456789\';}" id="getTitleDetails_123456789">\r\n\t\t\t            \tcheck store inventory\r\n\t\t\t            </a>', 1)

This is where I got confused. any ideas?

Edit: the sku number changes per product so therein lies the trouble for me

Upvotes: 1

Answers (5)

Zach

Reputation: 30321

http://txt2re.com/ might help you

Upvotes: 0

themissinglint

Reputation: 73

if there are always 9 digits

http://www.example.com/online/store/TitleDetail?detail&sku=[0-9]{9}

if there are an arbitrary number of digits:

http://www.example.com/online/store/TitleDetail?detail&sku=[0-9]*

more general:

http*?sku=[0-9]*

(the ? in *? means it will find shorter matches first, so it is less likely to find a match that spans multiple URLs.)

edit: [0-9]. not [1-9]

Upvotes: 0

cryo

Reputation: 14515

You don't need regular expressions for that, just use string methods:

result = html[0].split("window.location='")[1].split("'")[0]

Upvotes: 0

Matthew Flaschen

Reputation: 284927

pattern = re.compile(r"window.location=\\'([^\\]*)")
haystack = r"""<a href="javascript:if(handleDoubleClick(this.id)){window.location=\'http://www.example.com/online/store/TitleDetail?detail&sku=123456789\';}" id="getTitleDetails_123456789">\r\n\t\t\t\tcheck store inventory\r\n\t\t\t</a>"""
url = re.search(pattern, haystack).group(1)

Upvotes: 0

arthurprs

Reputation: 4627

http://www\.example\.com/online/store/TitleDetail\?detail&sku=\d+

use the \d group with a "Greedy" +, to qualify any integer value in the sku field

Upvotes: 1

How to use regular expressions to pull a substring? (screen scraping)

Answers (5)

Related Questions