Reputation: 20302
I'm trying to print HREF tags of the link below.
Here's my first attempt.
# the Python 3 version:
from bs4 import BeautifulSoup
import urllib.request
resp = urllib.request.urlopen("https://www.linkedin.com/search/results/all/?keywords=tim%20morgan&origin=GLOBAL_SEARCH_HEADER")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])
When I run that, I get this.
/feed/
/feed/
/feed/
/mynetwork/
/jobs/
/messaging/
/notifications/
#
Here's my second attempt.
# and a version using the requests library, which as written will work in both Python 2 and 3:
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.linkedin.com/search/results/all/?keywords=tim%20morgan&origin=GLOBAL_SEARCH_HEADER')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="https"]') ]
print(links)
When I run that, I get this.
['https://static-exp1.licdn.com/sc/h/n7m1fekt1d9hawp3s7wats11', 'https://static-exp1.licdn.com/sc/h/al2o9zrvru7aqj8e1x2rzsrca', 'https://static-exp1.licdn.com/sc/h/2if24wp7oqlodqdlgei1n1520', 'https://static-exp1.licdn.com/sc/h/eahiplrwoq61f4uan012ia17i', 'https://static-exp1.licdn.com/sc/h/2if24wp7oqlodqdlgei1n1520', 'https://static-exp1.licdn.com/sc/h/eahiplrwoq61f4uan012ia17i', 'https://static-exp1.licdn.com/sc/h/c7y7qgvm2uh1zn8pgl84l3rty', 'https://static-exp1.licdn.com/sc/h/auhsc2hi2zkvt7nbqep2ejauv', 'https://static-exp1.licdn.com/sc/h/9vf4mi871c6wolrcm3pgqywes', 'https://static-exp1.licdn.com/sc/h/7z1536jzhgep1sw5uk19e8ec7', 'https://static-exp1.licdn.com/sc/h/a0on5mxqtufmy9y66neg9mdgy', 'https://static-exp1.licdn.com/sc/h/1edhu1lemiqjsbgubat2dejxr', 'https://static-exp1.licdn.com/sc/h/2gdon0pq1074su3zwdop1y2g1']
I was expecting to see something like this:
https://www.linkedin.com/in/timlmorgan/
https://www.linkedin.com/in/timmorgan3/
https://www.linkedin.com/in/tim-morgan-19543731/
etc., etc., etc.
I guess LinkedIn must be doing something special, which I'm not aware of. When I run the same code against 'https://www.nytimes.com/', I get the results that I would expect. This is just a learning exercise. I'm curious to know what's going on here. I'm not interested in actually scanning LinkedIn for data.
Upvotes: 0
Views: 1419
Reputation: 8078
LinkedIn loads in data asynchronously, if we actually view-source (Ctrl + U on Windows) on that URL you're fetching, you won't find your expected results, because Javascript is loading them after the page has already loaded with the base information.
BeautifulSoup won't execute the Javascript on the page that fetches that data.
To solve this, one would actually figure out the API functions and have your script call those.
Except adjusting your call to pass the CSRF check. Or actually utilizing their API.
Upvotes: 4
Reputation: 20302
I tested some Selenium code which seems to do the trick.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox(executable_path=r'C:\files\geckodriver.exe')
driver.set_page_load_timeout(30)
driver.get("https://www.google.com/")
driver.get("https://www.linkedin.com/search/results/all/?keywords=tim%20morgan&origin=GLOBAL_SEARCH_HEADER")
continue_link = driver.find_element_by_tag_name('a')
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
print(elem.get_attribute("href"))
Upvotes: 3