Download a file with selenium on Heroku

Question

I am attempting to download a file from a link, parse the file, then save specific data to my heroku database. I have successfully set up my selenium chrome webdriver and I am able to log in. Normally, when I get the url, it begins downloading automatically. I have set up a new directory for the file to be saved to on heroku. It does not appear to be here or anywhere.

I have tried different methods of setting the download directory, other methods of logging in to the website, and have functionally done it locally, but not in heroku production.

# importing libraries
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time
import datetime
from datetime import timedelta
import os
import json
import csv 

# temporary credentials to later be stored
# as env vars
user = "user"
psw = "pasw"
account = 'account'

# this is the directory to download the file
file_directory = os.path.abspath('files')

# making this directory the default chrome web driver directory
options = webdriver.ChromeOptions()
prefs = {
"download.default_directory": file_directory
        }
options.add_experimental_option('prefs',prefs)
# setting up web driver
driver = webdriver.Chrome(chrome_options=options)

# logging in to pinterest
url_login = 'https://www.pinterest.com/login/?referrer=home_page'
driver.get(url_login)

username = driver.find_element_by_id("email")
username.send_keys(user)
password = driver.find_element_by_id("password")
password.send_keys(psw)
driver.find_element_by_id("password").send_keys(Keys.ENTER)

# sleep 20 sec so page loads fully
time.sleep(20)

# collect metrics for yesterday
yesterday = datetime.date.today() - datetime.timedelta(days=1)
yesterday = str(yesterday)

# download link for metrics 
url = "https://analytics.pinterest.com/analytics/profile/" + account + "/export/?application=all&tab=impressions&end_date=" + yesterday + '&start_date=' + yesterday
driver.get(url)

# setting up file identification for pinterest CSV file
date = datetime.date.today() - datetime.timedelta(days=2)
date = str(date)[:10]
file_location = os.path.join(file_directory,'profile-'+account+'-impressions-all-'+date+'.csv')

# opening up file
test_list = []
with open(file_location,newline = '', encoding = 'utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        test_list.append(row)

# gathering relevant metrics for yesterday
this_list = test_list[1:3]

# re-organizing metrics
this_dict = {}
i=0
while(i



I expect that the get("https://analytics.pinterest.com/analytics/profile/" + account + "/export/?application=all&tab=impressions&end_date=" + yesterday + '&start_date=' + yesterday) will download the CSV to the directory I have specified. It does not. I have used heroku run bash and searched through to try to find it, but it does not work. 

UPDATE I do NOT need to store the file permanently. I need to store it temporarily and parse it. I understand that on a dyno restart it will all be lost.

** UPDATE** I have done this with another method. I have passed the cookies and header to a requests session. I used a 'User-Agent' of a chrome browser on Linux. I then assigned the file to a variable (csv_file = s.get(url)). I split the lines up to an array. I then used an empty string and the .join() method to add each line to one massive string. I then parsed the string by identifiers that would normally separate the lines in a csv. I now have the relevant metrics.

Daniel Roseman · Accepted Answer

The thing you're missing is that heroku run bash will start a different dyno, with no access to the filesystem of the one that downloaded the file.

It's fine to use the Heroku filesystem as temporary storage for actions within the same process. But if you need access to stored files from a separate process, you should use something else, eg S3.

Download a file with selenium on Heroku

Answers (1)

Related Questions