Reputation: 606
This website contains an 'Export Data' link, which downloads the contents of the page into a csv file. The button does not contain a link to the csv file, but instead runs a javascript procedure. I want to open the csv file directly with pandas, rather than downloading it, figuring out the download folder, then opening it from there. Is this possible?
My existing code uses selenium to click the button, although if there is a better way to do that, I'd love to hear it.
# assign chrome driver path to variable
chrome_path = chromepath
# create browser object
driver=webdriver.Chrome(chrome_path)
# assign url variable
url = 'http://www.fangraphs.com/projections.aspx?pos=all&stats=bat&type=fangraphsdc&team=0&lg=all&players=0&sort=24%2cd'
# navigate to web page
driver.get(url)
# click export data button
driver.find_element_by_link_text("Export Data").click()
#close driver
driver.quit()
Upvotes: 1
Views: 2124
Reputation: 394
Just happened to come across this and have a script that should work if you change the URL. Instead of using selenium to download the CSV, soup is used to scrape the tables within the page and pandas is used to create the table(s) for CSV export.
Just make sure it has the "page=1_100000" at the end to get all rows. Let me know if you have any questions.
import requests
from random import choice
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urlparse, parse_qs
from functools import reduce
desktop_agents = ['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0']
url = "https://www.fangraphs.com/leaders.aspx?pos=np&stats=bat&lg=all&qual=0&type=c,4,6,5,23,9,10,11,13,12,21,22,60,18,35,34,50,40,206,207,208,44,43,46,45,24,26,25,47,41,28,110,191,192,193,194,195,196,197,200&season=2018&month=0&season1=2018&ind=0&team=0&rost=0&age=0&filter=&players=0&page=1_100000"
def random_headers():
return {'User-Agent': choice(desktop_agents),'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
df3 = pd.DataFrame()
# get the url
page_request = requests.get(url,headers=random_headers())
soup = BeautifulSoup(page_request.text,"lxml")
table = soup.find_all('table')[11]
data = []
# pulls headings from the fangraphs table
column_headers = []
headingrows = table.find_all('th')
for row in headingrows[0:]:
column_headers.append(row.text.strip())
data.append(column_headers)
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols[1:]])
ID = []
for tag in soup.select('a[href^=statss.aspx?playerid=]'):
link = tag['href']
query = parse_qs(link)
ID.append(query)
df1 = pd.DataFrame(data)
df1 = df1.rename(columns=df1.iloc[0])
df1 = df1.loc[1:].reset_index(drop=True)
df2 = pd.DataFrame(ID)
df2.drop(['position'], axis = 1, inplace = True, errors = 'ignore')
df2['statss.aspx?playerid'] = df2['statss.aspx?playerid'].str[0]
df3 = pd.concat([df1, df2], axis=1)
df3.to_csv("HittingGA2018.csv")
Upvotes: 1