jped
jped

Reputation: 486

Beautiful soup pareses in some cases but not in others. Why?

I am using Beautiful Soup to parse some JSON out of an HTML file. Basically I am using to get all employee profiles out of a LinkedIn search result. However, for some reason it does not work with companies that have more than 10 employees for some reason. Here is my code

import requests, json
from bs4 import BeautifulSoup
s = requests.session()

def get_csrf_tokens():
    url = "https://www.linkedin.com/"
    req = s.get(url).text

    csrf_token = req.split('name="csrfToken" value=')[1].split('" id="')[0]
    login_csrf_token = req.split('name="loginCsrfParam" value="')[1].split('" id="')[0]

    return csrf_token, login_csrf_token


def login(username, password):
    url = "https://www.linkedin.com/uas/login-submit"
    csrfToken, loginCsrfParam = get_csrf_tokens()

    data = {
        'session_key': username,
        'session_password': password,
        'csrfToken': csrfToken,
        'loginCsrfParam': loginCsrfParam
    }

    req = s.post(url, data=data)
    print "success"

login(USERNAME PASSWORD)
def get_all_json(company_link):
    r=s.get(company_link)
    html= r.content
    soup=BeautifulSoup(html)
    html_file= open("html_file.html", 'w')
    html_file.write(html)
    html_file.close()
    Json_stuff=soup.find('code', id="voltron_srp_main-content")
    print Json_stuff
    return remove_tags(Json_stuff)
def remove_tags(p):
    p=str(p)
    return p[62: -10]

def list_of_employes():
    jsons=get_all_json('https://www.linkedin.com/vsearch/p?f_CC=2409087')
    print jsons
    loaded_json=json.loads(jsons.replace(r'\u002d', '-'))
    employes=loaded_json['content']['page']['voltron_unified_search_json']['search']['results']
    return employes
def get_employee_link(employes):
    profiles=[]
    for employee in employes:
        print employee['person']['link_nprofile_view_3']
        profiles.append(employee['person']['link_nprofile_view_3'])
    return profiles , len(profiles)

print get_employee_link(list_of_employes())

It will not work for the link that is in place; however it will work for this company search: https://www.linkedin.com/vsearch/p?f_CC=3003796

EDIT: I am pretty sure that this is an error with the get_all_json() function. If you take a look, it does not correctly fetch the JSON for companies with more than 10 employees.

Upvotes: 1

Views: 314

Answers (2)

jped
jped

Reputation: 486

Turns out it was a problem with the default BeautifulSoup parser. I changed it to html5lib by doing this:

Install in the console

pip install html5lib

And change the type of parser you choose when first creating the soup object.

soup = BeautifulSoup(html, 'html5lib')

This is documented in the BeautifulSoup docs here

Upvotes: 0

alecxe
alecxe

Reputation: 473893

This is because the results are paginated. You need get over all pages defined inside the json data at:

data['content']['page']['voltron_unified_search_json']['search']['baseData']['resultPagination']['pages']

pages is a list, for the company 2409087 it is:

[{u'isCurrentPage': True, u'pageNum': 1, u'pageURL': u'http://www.linkedin.com/vsearch/p?f_CC=2409087&page_num=1'}, 
 {u'isCurrentPage': False, u'pageNum': 2, u'pageURL': u'http://www.linkedin.com/vsearch/p?f_CC=2409087&page_num=2', u'page_number_i18n': u'Page 2'}, 
 {u'isCurrentPage': False, u'pageNum': 3, u'pageURL': u'http://www.linkedin.com/vsearch/p?f_CC=2409087&page_num=3', u'page_number_i18n': u'Page 3'}]

This is basically a list of URLs you need to get over and get the data.

Here's what you need to do (ommiting the code for login):

def get_results(json_code):
    return json_code['content']['page']['voltron_unified_search_json']['search']['results']

url = "https://www.linkedin.com/vsearch/p?f_CC=2409087"
soup = BeautifulSoup(s.get(url).text)

code = soup.find('code', id="voltron_srp_main-content").contents[0].replace(r'\u002d', '-')
json_code = json.loads(code)

results = get_results(json_code)

pages = json_code['content']['page']['voltron_unified_search_json']['search']['baseData']['resultPagination']['pages']
for page in pages[1:]:
    soup = BeautifulSoup(s.get(page['pageURL']).text)
    code = soup.find('code', id="voltron_srp_main-content").contents[0].replace(r'\u002d', '-')
    json_code = json.loads(code)
    results += get_results(json_code)

print len(results)

It prints 25 for https://www.linkedin.com/vsearch/p?f_CC=2409087 - exactly how much you see in browser.

Upvotes: 1

Related Questions