Reputation: 486
I am using Beautiful Soup to parse some JSON out of an HTML file. Basically I am using to get all employee profiles out of a LinkedIn search result. However, for some reason it does not work with companies that have more than 10 employees for some reason. Here is my code
import requests, json
from bs4 import BeautifulSoup
s = requests.session()
def get_csrf_tokens():
url = "https://www.linkedin.com/"
req = s.get(url).text
csrf_token = req.split('name="csrfToken" value=')[1].split('" id="')[0]
login_csrf_token = req.split('name="loginCsrfParam" value="')[1].split('" id="')[0]
return csrf_token, login_csrf_token
def login(username, password):
url = "https://www.linkedin.com/uas/login-submit"
csrfToken, loginCsrfParam = get_csrf_tokens()
data = {
'session_key': username,
'session_password': password,
'csrfToken': csrfToken,
'loginCsrfParam': loginCsrfParam
}
req = s.post(url, data=data)
print "success"
login(USERNAME PASSWORD)
def get_all_json(company_link):
r=s.get(company_link)
html= r.content
soup=BeautifulSoup(html)
html_file= open("html_file.html", 'w')
html_file.write(html)
html_file.close()
Json_stuff=soup.find('code', id="voltron_srp_main-content")
print Json_stuff
return remove_tags(Json_stuff)
def remove_tags(p):
p=str(p)
return p[62: -10]
def list_of_employes():
jsons=get_all_json('https://www.linkedin.com/vsearch/p?f_CC=2409087')
print jsons
loaded_json=json.loads(jsons.replace(r'\u002d', '-'))
employes=loaded_json['content']['page']['voltron_unified_search_json']['search']['results']
return employes
def get_employee_link(employes):
profiles=[]
for employee in employes:
print employee['person']['link_nprofile_view_3']
profiles.append(employee['person']['link_nprofile_view_3'])
return profiles , len(profiles)
print get_employee_link(list_of_employes())
It will not work for the link that is in place; however it will work for this company search: https://www.linkedin.com/vsearch/p?f_CC=3003796
EDIT: I am pretty sure that this is an error with the get_all_json() function. If you take a look, it does not correctly fetch the JSON for companies with more than 10 employees.
Upvotes: 1
Views: 314
Reputation: 486
Turns out it was a problem with the default BeautifulSoup parser. I changed it to html5lib by doing this:
Install in the console
pip install html5lib
And change the type of parser you choose when first creating the soup object.
soup = BeautifulSoup(html, 'html5lib')
This is documented in the BeautifulSoup docs here
Upvotes: 0
Reputation: 473893
This is because the results are paginated. You need get over all pages defined inside the json data at:
data['content']['page']['voltron_unified_search_json']['search']['baseData']['resultPagination']['pages']
pages
is a list, for the company 2409087
it is:
[{u'isCurrentPage': True, u'pageNum': 1, u'pageURL': u'http://www.linkedin.com/vsearch/p?f_CC=2409087&page_num=1'},
{u'isCurrentPage': False, u'pageNum': 2, u'pageURL': u'http://www.linkedin.com/vsearch/p?f_CC=2409087&page_num=2', u'page_number_i18n': u'Page 2'},
{u'isCurrentPage': False, u'pageNum': 3, u'pageURL': u'http://www.linkedin.com/vsearch/p?f_CC=2409087&page_num=3', u'page_number_i18n': u'Page 3'}]
This is basically a list of URLs you need to get over and get the data.
Here's what you need to do (ommiting the code for login):
def get_results(json_code):
return json_code['content']['page']['voltron_unified_search_json']['search']['results']
url = "https://www.linkedin.com/vsearch/p?f_CC=2409087"
soup = BeautifulSoup(s.get(url).text)
code = soup.find('code', id="voltron_srp_main-content").contents[0].replace(r'\u002d', '-')
json_code = json.loads(code)
results = get_results(json_code)
pages = json_code['content']['page']['voltron_unified_search_json']['search']['baseData']['resultPagination']['pages']
for page in pages[1:]:
soup = BeautifulSoup(s.get(page['pageURL']).text)
code = soup.find('code', id="voltron_srp_main-content").contents[0].replace(r'\u002d', '-')
json_code = json.loads(code)
results += get_results(json_code)
print len(results)
It prints 25
for https://www.linkedin.com/vsearch/p?f_CC=2409087 - exactly how much you see in browser.
Upvotes: 1