oba2311
oba2311

Reputation: 381

Scraping a table from a page using beautifulsoup, table is not found

I've been trying to scrape the table from here but it seems to me that BeautifulSoup doesn't find any table.

I wrote:

import requests
import pandas as pd
from bs4 import BeautifulSoup
import csv

url = "http://www.payscale.com/college-salary-report/bachelors?page=65" 
r=requests.get(url)
data=r.text

soup=BeautifulSoup(data,'xml')
table=soup.find_all('table')
print table   #prints nothing..

Based on other similar questions, I assume that the HTML is broken in someway, but I'm not an expert.. Couldn't find an answer in those: (Beautiful soup missing some html table tags), (Extracting a table from a website), (Scraping a table using BeautifulSoup), or even (Python+BeautifulSoup: scraping a particular table from a webpage)

Thanks a bunch!

Upvotes: 3

Views: 3661

Answers (3)

MD. Khairul Basar
MD. Khairul Basar

Reputation: 5110

You are parsing html but you used xml parser.
You should use soup=BeautifulSoup(data,"html.parser")
Your necessary data is in script tag, in fact there is no table tag actually. So, you need to find texts inside script.
N.B: If you are using Python 2.x then use "HTMLParser" instead of "html.parser".

Here is the code.

import csv
import requests
from bs4 import BeautifulSoup

url = "http://www.payscale.com/college-salary-report/bachelors?page=65" 
r=requests.get(url)
data=r.text

soup=BeautifulSoup(data,"html.parser")
scripts = soup.find_all("script")

file_name = open("table.csv","w",newline="")
writer = csv.writer(file_name)
list_to_write = []

list_to_write.append(["Rank","School Name","School Type","Early Career Median Pay","Mid-Career Median Pay","% High Job Meaning","% STEM"])

for script in scripts:
    text = script.text
    start = 0
    end = 0
    if(len(text) > 10000):
        while(start > -1):
            start = text.find('"School Name":"',start)
            if(start == -1):
                break
            start += len('"School Name":"')
            end = text.find('"',start)
            school_name = text[start:end]

            start = text.find('"Early Career Median Pay":"',start)
            start += len('"Early Career Median Pay":"')
            end = text.find('"',start)
            early_pay = text[start:end]

            start = text.find('"Mid-Career Median Pay":"',start)
            start += len('"Mid-Career Median Pay":"')
            end = text.find('"',start)
            mid_pay = text[start:end]

            start = text.find('"Rank":"',start)
            start += len('"Rank":"')
            end = text.find('"',start)
            rank = text[start:end]

            start = text.find('"% High Job Meaning":"',start)
            start += len('"% High Job Meaning":"')
            end = text.find('"',start)
            high_job = text[start:end]

            start = text.find('"School Type":"',start)
            start += len('"School Type":"')
            end = text.find('"',start)
            school_type = text[start:end]

            start = text.find('"% STEM":"',start)
            start += len('"% STEM":"')
            end = text.find('"',start)
            stem = text[start:end]

            list_to_write.append([rank,school_name,school_type,early_pay,mid_pay,high_job,stem])
writer.writerows(list_to_write)
file_name.close()

This will generate your necessary table in csv. Don't forget to close the file when you are done.

Upvotes: 2

宏杰李
宏杰李

Reputation: 12158

The data is located in JavaScript variable, you should find the js text data then use regex to extract it. when you get the data, it's json list object which contains 900+ school dict, you should use json module to load it to python list obejct.

import requests, bs4, re, json

url = "http://www.payscale.com/college-salary-report/bachelors?page=65"
r = requests.get(url)
data = r.text
soup = bs4.BeautifulSoup(data, 'lxml')
var = soup.find(text=re.compile('collegeSalaryReportData'))
table_text = re.search(r'collegeSalaryReportData = (\[.+\]);\n    var', var, re.DOTALL).group(1)
table_data = json.loads(table_text)
pprint(table_data)
print('The number of school', len(table_data))

out:

 {'% Female': '0.57',
  '% High Job Meaning': 'N/A',
  '% Male': '0.43',
  '% Pell': 'N/A',
  '% STEM': '0.1',
  '% who Recommend School': 'N/A',
  'Division 1 Basketball Classifications': 'Not Division 1 Basketball',
  'Division 1 Football Classifications': 'Not Division 1 Football',
  'Early Career Median Pay': '36200',
  'IPEDS ID': '199643',
  'ImageUrl': '/content/school_logos/Shaw University_50px.png',
  'Mid-Career Median Pay': '45600',
  'Rank': '963',
  'School Name': 'Shaw University',
  'School Sector': 'Private not-for-profit',
  'School Type': 'Private School, Religious',
  'State': 'North Carolina',
  'Undergraduate Enrollment': '1664',
  'Url': '/research/US/School=Shaw_University/Salary',
  'Zip Code': '27601'}]
The number of school 963

Upvotes: 2

metame
metame

Reputation: 2640

While this won't find the table that's not in r.text, you are asking BeautifulSoup to use the xml parser instead of html.parser so I would recommend changing that line to:

soup=BeautifulSoup(data,'html.parser')

One of the issues you will run into with web scraping is what are called "client-rendered" websites versus server-rendered. Basically, this means that the page you would get from a basic html request through the requests module or through curl for example is not the same content that would be rendered in a web browser. Some of the common frameworks for this are React and Angular. If you examine the source of the page you are wanting to scrape, they have data-react-ids on several of their html elements. A common tell for Angular pages are similar element attributes with the prefix ng, e.g. ng-if or ng-bind. You can see the page's source in Chrome or Firefox through their respective dev tools, which can be launched with the keyboard shortcut Ctrl+Shift+I in either browser. It's worth noting that not all React & Angular pages are only client-rendered.

In order to get this sort of content, you would need to use a headless browser tool like Selenium. There are many resources on web scraping with Selenium and Python.

Upvotes: 2

Related Questions