Trying to print HREF tags from a site and getting weird results

Question

I'm trying to print HREF tags of the link below.

Here's my first attempt.

# the Python 3 version:
from bs4 import BeautifulSoup
import urllib.request

resp = urllib.request.urlopen("https://www.linkedin.com/search/results/all/?keywords=tim%20morgan&origin=GLOBAL_SEARCH_HEADER")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    print(link['href'])

When I run that, I get this.

/feed/
/feed/
/feed/
/mynetwork/
/jobs/
/messaging/
/notifications/
#

Here's my second attempt.

# and a version using the requests library, which as written will work in both Python 2 and 3:
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.linkedin.com/search/results/all/?keywords=tim%20morgan&origin=GLOBAL_SEARCH_HEADER')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="https"]') ]
print(links)

When I run that, I get this.

['https://static-exp1.licdn.com/sc/h/n7m1fekt1d9hawp3s7wats11', 'https://static-exp1.licdn.com/sc/h/al2o9zrvru7aqj8e1x2rzsrca', 'https://static-exp1.licdn.com/sc/h/2if24wp7oqlodqdlgei1n1520', 'https://static-exp1.licdn.com/sc/h/eahiplrwoq61f4uan012ia17i', 'https://static-exp1.licdn.com/sc/h/2if24wp7oqlodqdlgei1n1520', 'https://static-exp1.licdn.com/sc/h/eahiplrwoq61f4uan012ia17i', 'https://static-exp1.licdn.com/sc/h/c7y7qgvm2uh1zn8pgl84l3rty', 'https://static-exp1.licdn.com/sc/h/auhsc2hi2zkvt7nbqep2ejauv', 'https://static-exp1.licdn.com/sc/h/9vf4mi871c6wolrcm3pgqywes', 'https://static-exp1.licdn.com/sc/h/7z1536jzhgep1sw5uk19e8ec7', 'https://static-exp1.licdn.com/sc/h/a0on5mxqtufmy9y66neg9mdgy', 'https://static-exp1.licdn.com/sc/h/1edhu1lemiqjsbgubat2dejxr', 'https://static-exp1.licdn.com/sc/h/2gdon0pq1074su3zwdop1y2g1']

I was expecting to see something like this:

https://www.linkedin.com/in/timlmorgan/
https://www.linkedin.com/in/timmorgan3/
https://www.linkedin.com/in/tim-morgan-19543731/

etc., etc., etc.

I guess LinkedIn must be doing something special, which I'm not aware of. When I run the same code against 'https://www.nytimes.com/', I get the results that I would expect. This is just a learning exercise. I'm curious to know what's going on here. I'm not interested in actually scanning LinkedIn for data.

Sunny Patel · Accepted Answer

LinkedIn loads in data asynchronously, if we actually view-source (Ctrl + U on Windows) on that URL you're fetching, you won't find your expected results, because Javascript is loading them after the page has already loaded with the base information.

BeautifulSoup won't execute the Javascript on the page that fetches that data.

To solve this, one would actually figure out the API functions and have your script call those.

https://www.linkedin.com/voyager/api/search/filters?filters=List()&keywords=tim%20morgan&q=universalAll&queryContext=List(primaryHitType-%3EPEOPLE)

Except adjusting your call to pass the CSRF check. Or actually utilizing their API.

Trying to print HREF tags from a site and getting weird results

Answers (2)

Related Questions