Reputation: 289
I want to grab someones website from their instagram bio. Instagram hides this website in text/javascript tag so I can't grab the url like I would normally with an anchor from beautifulsoup. Here is a fragment of the page source that contains what I'm trying to capture:
...,"country_block":false,"external_url":"https://www.brittanyannecohen.com/pattern-control","blocked_by_viewer":false,...
I noticed that the link I want to grab is always attached to an external_url
attribute in a dictionary (see picture below).
I attampted to grab this url through using regex but it's not working , see code below
url=re.findall("[\"external_url\":]['https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+']",soup)
but I get error :
bad character range [-\w at position 31
Upvotes: 3
Views: 106
Reputation: 425268
You have a square bracket where you should have a parenthesis:
url=re.findall("[\"external_url\":]['https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+']",soup)
url=re.findall("[\"external_url\":]('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+']",soup)
^--- change [ to (
The clue was in the error message bad character range [-\w
, which meant the character class had started earlier that that expression. Looking earlier we find ['https?:...
, which doesn't makes sense either and that's
where the problem was.
I don't know if your regex actually works - it's too complicated to check, especially when there's a simpler way to do it.
Use this regex
(?<="external_url":")[^"]+
And the entire match will be your target url.
See live regex demo.
Upvotes: 1