Beautiful soup pareses in some cases but not in others. Why?

Question

I am using Beautiful Soup to parse some JSON out of an HTML file. Basically I am using to get all employee profiles out of a LinkedIn search result. However, for some reason it does not work with companies that have more than 10 employees for some reason. Here is my code

import requests, json
from bs4 import BeautifulSoup
s = requests.session()

def get_csrf_tokens():
    url = "https://www.linkedin.com/"
    req = s.get(url).text

    csrf_token = req.split('name="csrfToken" value=')[1].split('" id="')[0]
    login_csrf_token = req.split('name="loginCsrfParam" value="')[1].split('" id="')[0]

    return csrf_token, login_csrf_token


def login(username, password):
    url = "https://www.linkedin.com/uas/login-submit"
    csrfToken, loginCsrfParam = get_csrf_tokens()

    data = {
        'session_key': username,
        'session_password': password,
        'csrfToken': csrfToken,
        'loginCsrfParam': loginCsrfParam
    }

    req = s.post(url, data=data)
    print "success"

login(USERNAME PASSWORD)
def get_all_json(company_link):
    r=s.get(company_link)
    html= r.content
    soup=BeautifulSoup(html)
    html_file= open("html_file.html", 'w')
    html_file.write(html)
    html_file.close()
    Json_stuff=soup.find('code', id="voltron_srp_main-content")
    print Json_stuff
    return remove_tags(Json_stuff)
def remove_tags(p):
    p=str(p)
    return p[62: -10]

def list_of_employes():
    jsons=get_all_json('https://www.linkedin.com/vsearch/p?f_CC=2409087')
    print jsons
    loaded_json=json.loads(jsons.replace(r'\u002d', '-'))
    employes=loaded_json['content']['page']['voltron_unified_search_json']['search']['results']
    return employes
def get_employee_link(employes):
    profiles=[]
    for employee in employes:
        print employee['person']['link_nprofile_view_3']
        profiles.append(employee['person']['link_nprofile_view_3'])
    return profiles , len(profiles)

print get_employee_link(list_of_employes())

It will not work for the link that is in place; however it will work for this company search: https://www.linkedin.com/vsearch/p?f_CC=3003796

EDIT: I am pretty sure that this is an error with the get_all_json() function. If you take a look, it does not correctly fetch the JSON for companies with more than 10 employees.

alecxe · Accepted Answer

This is because the results are paginated. You need get over all pages defined inside the json data at:

data['content']['page']['voltron_unified_search_json']['search']['baseData']['resultPagination']['pages']

pages is a list, for the company 2409087 it is:

[{u'isCurrentPage': True, u'pageNum': 1, u'pageURL': u'http://www.linkedin.com/vsearch/p?f_CC=2409087&page_num=1'}, 
 {u'isCurrentPage': False, u'pageNum': 2, u'pageURL': u'http://www.linkedin.com/vsearch/p?f_CC=2409087&page_num=2', u'page_number_i18n': u'Page 2'}, 
 {u'isCurrentPage': False, u'pageNum': 3, u'pageURL': u'http://www.linkedin.com/vsearch/p?f_CC=2409087&page_num=3', u'page_number_i18n': u'Page 3'}]

This is basically a list of URLs you need to get over and get the data.

Here's what you need to do (ommiting the code for login):

def get_results(json_code):
    return json_code['content']['page']['voltron_unified_search_json']['search']['results']

url = "https://www.linkedin.com/vsearch/p?f_CC=2409087"
soup = BeautifulSoup(s.get(url).text)

code = soup.find('code', id="voltron_srp_main-content").contents[0].replace(r'\u002d', '-')
json_code = json.loads(code)

results = get_results(json_code)

pages = json_code['content']['page']['voltron_unified_search_json']['search']['baseData']['resultPagination']['pages']
for page in pages[1:]:
    soup = BeautifulSoup(s.get(page['pageURL']).text)
    code = soup.find('code', id="voltron_srp_main-content").contents[0].replace(r'\u002d', '-')
    json_code = json.loads(code)
    results += get_results(json_code)

print len(results)

It prints 25 for https://www.linkedin.com/vsearch/p?f_CC=2409087 - exactly how much you see in browser.

Beautiful soup pareses in some cases but not in others. Why?

Answers (2)

Related Questions