Reputation: 785
Hey guys, i'm really trying to understand regular expressions while scraping a site, i've been using it in my code enough to pull the following, but am stuck here. I need to quickly grab this:
http://www.example.com/online/store/TitleDetail?detail&sku=123456789
from this:
('<a href="javascript:if(handleDoubleClick(this.id)){window.location=\'http://www.example.com/online/store/TitleDetail?detail&sku=123456789\';}" id="getTitleDetails_123456789">\r\n\t\t\t \tcheck store inventory\r\n\t\t\t </a>', 1)
This is where I got confused. any ideas?
Edit: the sku number changes per product so therein lies the trouble for me
Upvotes: 1
Views: 359
Reputation: 73
if there are always 9 digits
http://www.example.com/online/store/TitleDetail?detail&sku=[0-9]{9}
if there are an arbitrary number of digits:
http://www.example.com/online/store/TitleDetail?detail&sku=[0-9]*
more general:
http*?sku=[0-9]*
(the ? in *? means it will find shorter matches first, so it is less likely to find a match that spans multiple URLs.)
edit: [0-9]. not [1-9]
Upvotes: 0
Reputation: 14515
You don't need regular expressions for that, just use string methods:
result = html[0].split("window.location='")[1].split("'")[0]
Upvotes: 0
Reputation: 284927
pattern = re.compile(r"window.location=\\'([^\\]*)")
haystack = r"""<a href="javascript:if(handleDoubleClick(this.id)){window.location=\'http://www.example.com/online/store/TitleDetail?detail&sku=123456789\';}" id="getTitleDetails_123456789">\r\n\t\t\t\tcheck store inventory\r\n\t\t\t</a>"""
url = re.search(pattern, haystack).group(1)
Upvotes: 0
Reputation: 4627
http://www\.example\.com/online/store/TitleDetail\?detail&sku=\d+
use the \d group with a "Greedy" +, to qualify any integer value in the sku field
Upvotes: 1