Reputation: 462
So I am trying to webscrape LinkedIn's about page to get the 'specialties' for certain companies. When trying to scrape LinkedIn with beautiful soup it gives me an access denied error so I am using a header to fake my browser. However, it gives this output instead of the corresponding HTML:
\n\nwindow.onload = function() {\n // Parse the tracking code from cookies.\n var trk = "bf";\n var trkInfo = "bf";\n var cookies = document.cookie.split("; ");\n for (var i = 0; i < cookies.length; ++i) {\n if ((cookies[i].indexOf("trkCode=") == 0) && (cookies[i].length > 8)) {\n trk = cookies[i].substring(8);\n }\n else if ((cookies[i].indexOf("trkInfo=") == 0) && (cookies[i].length > 8)) {\n trkInfo = cookies[i].substring(8);\n }\n }\n\n if (window.location.protocol == "http:") {\n // If "sl" cookie is set, redirect to https.\n for (var i = 0; i < cookies.length; ++i) {\n if ((cookies[i].indexOf("sl=") == 0) && (cookies[i].length > 3)) {\n window.location.href = "https:" + window.location.href.substring(window.location.protocol.length);\n return;\n }\n }\n }\n\n // Get the new domain. For international domains such as\n // fr.linkedin.com, we convert it to www.linkedin.com\n var domain = "www.linkedin.com";\n if (domain != location.host) {\n var subdomainIndex = location.host.indexOf(".linkedin");\n if (subdomainIndex != -1) {\n domain = "www" + location.host.substring(subdomainIndex);\n }\n }\n\n window.location.href = "https://" + domain + "/authwall?trk=" + trk + "&trkInfo=" + trkInfo +\n "&originalReferer=" + document.referrer.substr(0, 200) +\n "&sessionRedirect=" + encodeURIComponent(window.location.href);\n}\n\n'
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.linkedin.com/company/biotech/'
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14;
rv:66.0) Gecko/20100101 Firefox/66.0", "Accept":
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate",
"DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}
response = requests.get(url, headers=headers)
print(response.content)
What am I doing wrong? I think it trying to check for cookies. Is there a way I could add that into my code?
Upvotes: 5
Views: 2967
Reputation: 8270
You can use Selenium to get the page with dynamic JS content. Also you have to login since the page you want to retrieve is requires authentication. So:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
EMAIL = ''
PASSWORD = ''
driver = webdriver.Chrome()
driver.get('https://www.linkedin.com/company/biotech/')
el = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'form-toggle')))
driver.execute_script("arguments[0].click();", el)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'login-email'))).send_keys(EMAIL)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'login-password'))).send_keys(PASSWORD)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'login-submit'))).click()
text = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="ember71"]/dl/dd[4]'))).text
Output:
Distributing medical products
Upvotes: 0
Reputation: 3229
LinkedIn is actually performing some interesting Cookie setting and subsequent redirects, which prevents your code from working as is. This is clear from examining the JavaScript that is returned upon your initial request. Basically, HTTP Cookies are set by the web server for tracking information, and those cookies are parsed by the JavaScript you encounter, before the final redirection occurs. If you reverse engineer the JavaScript, you'll find that the final redirect is something like this (at least for me based on my location and tracking info):
url = 'https://www.linkedin.com/authwall?trk=bf&trkInfo=bf&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fbiotech%2F'
Also, you can use Python's requests module to maintain sessions for you, which will automatically manage HTTP headers such as cookies so you don't have to worry about it. The following should give you the HTML source you are looking for. I'll leave it to you to implement BeautifulSoup and parse what you desire.
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.linkedin.com/authwall?trk=bf&trkInfo=bf&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fbiotech%2F'
with requests.Session() as s:
response = s.get(url)
print(response.content)
Upvotes: 4
Reputation: 463
You need to beautifulsoup the response first.
page_content = BeautifulSoup(page_response.content, "html.parser")
#we use the html parser to parse the url content and store it in a variable.
textContent = []
for i in range(0, 20):
paragraphs = page_content.find_all("p")[i].text
textContent.append(paragraphs)
# In my use case, I want to store the speech data I mentioned earlier. so in this example, I loop through the paragraphs, and push them into an array so that I can manipulate and do fun stuff with the data.
not my example, but it can be found here https://codeburst.io/web-scraping-101-with-python-beautiful-soup-bb617be1f486
Upvotes: -1