Reputation: 65
I've used Selenium to scrape a dynamic Javascript table of Federal employee position and salary info from http://www.fedsdatacenter.com/federal-pay-rates/index.php?n=&l=&a=SECURITIES+AND+EXCHANGE+COMMISSION&o=&y=all. (Note: It's all public domain data, so no worries re: personal information).
I'm trying to get it into a Pandas DF for analysis. My problem is that my Selenium input data is a list that prints as:
[u'DOE,JON'], [u'14'], [u'SK'], [u'$176,571.00'], [u'$2,000.00'], [u'SECURITIES AND EXCHANGE COMMISSION'], [u'WASHINGTON'], [u'GENERAL ATTORNEY'], [u'2012']], ...
What I want to get to is a DF that handles an arbitrary number of records as:
NAME GRADE SCALE SALARY BONUS AGENCY LOCATION POSITION YEAR
Doe, Jon 14 SK $176,571.00 $2,000.00 SEC DC ATTY 2012
.
.
.
I've tried converting this list to a dictionary, using the zip() function with the col names as a tuple and the data as a list, etc., all to no avail, though it's been a good tour of Python's features. What should the next step be after getting the data or should I be reading the data in a different way?
Currently, the scraper code is:
from selenium import webdriver
path_to_chromedriver = '/Users/xxx/Documents/webdriver/chromedriver' # change path as needed
browser = webdriver.Chrome(executable_path = path_to_chromedriver)
url = 'http://www.fedsdatacenter.com/federal-pay-rates/index.php'
browser.get(url)
inputAgency = browser.find_element_by_id('a')
inputYear = browser.find_element_by_id('y')
# Send data
inputAgency.send_keys('SECURITIES AND EXCHANGE COMMISSION')
inputYear.send_keys('All')
# Select 'All' from Years element
browser.find_element_by_css_selector('input[type=\"submit\"]').click()
browser.find_element_by_xpath('//*[@id="example_length"]/label/select/option[4]').click()
SMRtable = browser.find_element_by_id('example')
scrapedData = []
for td in SMRtable.find_elements_by_xpath('.//td'):
scrapedData.append([td.get_attribute('innerHTML')])
print td.get_attribute('innerHTML')
Upvotes: 1
Views: 346
Reputation: 862511
You can use pandas
only.
So first you can check view page source of webpage:
Check lines no. 14807 - 14826:
// data table initialization
$(document).ready(function() {
$('#example').dataTable( {
"bPaginate": true,
"bFilter": false,
"bProcessing": true,
"bServerSide": true,
"aoColumns": [
null,
null,
null,
{ "sType": 'currency' }, // set currency columns to allow sorting
{ "sType": 'currency' }, // set second column to currency to allow sorting
null,
null,
null,
null
],
"sAjaxSource": "output.php?n=&a=SECURITIES AND EXCHANGE COMMISSION&l=&o=&y=all"
} );
} );
It means this page use dataTables and data are loaded from ajax source as JSON.
So instead of scrapping html, you can get clean nice json:
output.php?n=&a=SECURITIES AND EXCHANGE COMMISSION&l=&o=&y=all
And final link is (instead space
use %20
):
JSON:
{"sEcho":0,"iTotalRecords":"7072900","iTotalDisplayRecords":"19919","aaData":[
["ZUVER,SHAHEEN H","14","SK","$170,960.00","$0.00","SECURITIES AND EXCHANGE COMMISSION","WASHINGTON","GENERAL ATTORNEY","2014"],
["ZUR,MIA C.","14","SK","$164,875.00","$0.00","SECURITIES AND EXCHANGE COMMISSION","WASHINGTON","GENERAL ATTORNEY","2014"],
["ZUNDEL,JENNET LEONG","14","SK","$204,638.00","$0.00","SECURITIES AND EXCHANGE COMMISSION","SAN FRANCISCO","ACCOUNTING","2014"],
["ZUKOWSKI,DAVID W","04","SK","$38,382.00","$0.00","SECURITIES AND EXCHANGE COMMISSION","BOSTON","ADMIN AND OFFICE SUPPORT STUDENT TRAINEE","2014"],
...
So you can parse this json by pandas with read_json
:
import pandas as pd
df = pd.read_json("http://www.fedsdatacenter.com/federal-pay-rates/output.php?n=&a=SECURITIES%20AND%20EXCHANGE%20COMMISSION&l=&o=&y=all")
print df.head()
aaData iTotalDisplayRecords \
0 [ZUVER,SHAHEEN H, 14, SK, $170,960.00, $0.00, ... 19919
1 [ZUR,MIA C., 14, SK, $164,875.00, $0.00, SECUR... 19919
2 [ZUNDEL,JENNET LEONG, 14, SK, $204,638.00, $0.... 19919
3 [ZUKOWSKI,DAVID W, 04, SK, $38,382.00, $0.00, ... 19919
4 [ZOU,FAN, 14, SK, $166,650.00, $0.00, SECURITI... 19919
iTotalRecords sEcho
0 7072900 0
1 7072900 0
2 7072900 0
3 7072900 0
4 7072900 0
Then you get new DataFrame from column aaData
- use list comprehension:
df1 = pd.DataFrame([ x for x in df['aaData'] ])
Set columns names:
df1.columns = ['NAME','GRADE','SCALE','SALARY','BONUS','AGENCY','LOCATION','POSITION','YEAR']
print df1.head()
NAME GRADE SCALE SALARY BONUS \
0 ZUVER,SHAHEEN H 14 SK $170,960.00 $0.00
1 ZUR,MIA C. 14 SK $164,875.00 $0.00
2 ZUNDEL,JENNET LEONG 14 SK $204,638.00 $0.00
3 ZUKOWSKI,DAVID W 04 SK $38,382.00 $0.00
4 ZOU,FAN 14 SK $166,650.00 $0.00
AGENCY LOCATION \
0 SECURITIES AND EXCHANGE COMMISSION WASHINGTON
1 SECURITIES AND EXCHANGE COMMISSION WASHINGTON
2 SECURITIES AND EXCHANGE COMMISSION SAN FRANCISCO
3 SECURITIES AND EXCHANGE COMMISSION BOSTON
4 SECURITIES AND EXCHANGE COMMISSION WASHINGTON
POSITION YEAR
0 GENERAL ATTORNEY 2014
1 GENERAL ATTORNEY 2014
2 ACCOUNTING 2014
3 ADMIN AND OFFICE SUPPORT STUDENT TRAINEE 2014
4 INFORMATION TECHNOLOGY MANAGEMENT 2014
Upvotes: 1