user3725021
user3725021

Reputation: 606

Open csv file from website directly in pandas without downloading to folder

This website contains an 'Export Data' link, which downloads the contents of the page into a csv file. The button does not contain a link to the csv file, but instead runs a javascript procedure. I want to open the csv file directly with pandas, rather than downloading it, figuring out the download folder, then opening it from there. Is this possible?

My existing code uses selenium to click the button, although if there is a better way to do that, I'd love to hear it.

# assign chrome driver path to variable
chrome_path = chromepath

# create browser object
    driver=webdriver.Chrome(chrome_path)

# assign url variable    
url = 'http://www.fangraphs.com/projections.aspx?pos=all&stats=bat&type=fangraphsdc&team=0&lg=all&players=0&sort=24%2cd'

# navigate to web page    
driver.get(url)

# click export data button    
driver.find_element_by_link_text("Export Data").click()

#close driver
driver.quit()

Upvotes: 1

Views: 2124

Answers (1)

Nick
Nick

Reputation: 394

Just happened to come across this and have a script that should work if you change the URL. Instead of using selenium to download the CSV, soup is used to scrape the tables within the page and pandas is used to create the table(s) for CSV export.

Just make sure it has the "page=1_100000" at the end to get all rows. Let me know if you have any questions.

import requests
from random import choice
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urlparse, parse_qs
from functools import reduce

desktop_agents = ['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
                 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14',
                 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
                 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
                 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0']

url = "https://www.fangraphs.com/leaders.aspx?pos=np&stats=bat&lg=all&qual=0&type=c,4,6,5,23,9,10,11,13,12,21,22,60,18,35,34,50,40,206,207,208,44,43,46,45,24,26,25,47,41,28,110,191,192,193,194,195,196,197,200&season=2018&month=0&season1=2018&ind=0&team=0&rost=0&age=0&filter=&players=0&page=1_100000"

def random_headers():
    return {'User-Agent': choice(desktop_agents),'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}

df3 = pd.DataFrame()
# get the url

page_request = requests.get(url,headers=random_headers())
soup = BeautifulSoup(page_request.text,"lxml")

table = soup.find_all('table')[11]
data = []
# pulls headings from the fangraphs table
column_headers = []
headingrows = table.find_all('th')
for row in headingrows[0:]:
    column_headers.append(row.text.strip())

data.append(column_headers)
table_body = table.find('tbody')
rows = table_body.find_all('tr')

for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols[1:]])

ID = []

for tag in soup.select('a[href^=statss.aspx?playerid=]'):
    link = tag['href']
    query = parse_qs(link)
    ID.append(query)

df1 = pd.DataFrame(data)
df1 = df1.rename(columns=df1.iloc[0])
df1 = df1.loc[1:].reset_index(drop=True)

df2 = pd.DataFrame(ID)
df2.drop(['position'], axis = 1, inplace = True, errors = 'ignore')
df2['statss.aspx?playerid'] = df2['statss.aspx?playerid'].str[0]

df3 = pd.concat([df1, df2], axis=1)

df3.to_csv("HittingGA2018.csv")

Upvotes: 1

Related Questions