Bob
Bob

Reputation: 289

using regex or beautiful soup to grab someones website from instagram

I want to grab someones website from their instagram bio. Instagram hides this website in text/javascript tag so I can't grab the url like I would normally with an anchor from beautifulsoup. Here is a fragment of the page source that contains what I'm trying to capture:

...,"country_block":false,"external_url":"https://www.brittanyannecohen.com/pattern-control","blocked_by_viewer":false,...

I noticed that the link I want to grab is always attached to an external_url attribute in a dictionary (see picture below).

I attampted to grab this url through using regex but it's not working , see code below

url=re.findall("[\"external_url\":]['https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+']",soup)

but I get error :

bad character range [-\w at position 31

Upvotes: 3

Views: 106

Answers (1)

Bohemian
Bohemian

Reputation: 425268

You have a square bracket where you should have a parenthesis:

url=re.findall("[\"external_url\":]['https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+']",soup)
url=re.findall("[\"external_url\":]('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+']",soup)
                                   ^--- change [ to (

The clue was in the error message bad character range [-\w, which meant the character class had started earlier that that expression. Looking earlier we find ['https?:..., which doesn't makes sense either and that's where the problem was.

I don't know if your regex actually works - it's too complicated to check, especially when there's a simpler way to do it.

Use this regex

(?<="external_url":")[^"]+

And the entire match will be your target url.

See live regex demo.

Upvotes: 1

Related Questions