Tiger Strom
Tiger Strom

Reputation: 83

Python Selenium to Download File to Memory

I have a number of scripts that scrape the web, grab files, then read them using pandas. This procedure must be deployed under a new architecture in which downloading files from disc is not acceptable; instead, the file should be saved in memory and read with pandas from there.

The Websites doesn't provide a direct link to the file rather it has provided with a button that uses form submission to download it.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager


chrome_options = webdriver.ChromeOptions()
prefs = {'download': {'default_directory': #a link to memory}}
chrome_options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(options=chrome_options,service=Service(ChromeDriverManager().install()))

driver.get("https://www.speedchex.com/")
driver = login(driver) 

WebDriverWait(driver,15).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,"#CSVButton")))
driver.find_element(By.CSS_SELECTOR,'#CSVButton').click()   #This button Downloads the file.
download_wait()  # a function to check if download is finished or not
driver.quit()

Donwload_wait is just a function that will check the directory if there are any .crdownload files.

def download_wait():
    path_to_downloads = OUTPUT_FOLDER
    seconds = 0
    dl_wait = True
    while dl_wait and seconds < 200:
        time.sleep(1)
        dl_wait = False
        for fname in os.walk.files(filter=['*.crdownload']):
                dl_wait = True
        seconds += 1
    return seconds

The input tag that downloads the file is as follows.

<input name="CSVButton" type="button" id="CSVButton"  onclick="javascript: this.form.OutputType.value = 'CSV'; this.form.submit(); this.form.OutputType.value = 'HTML'; " value="CSV">

Upvotes: 1

Views: 927

Answers (1)

jsbueno
jsbueno

Reputation: 110291

Selenium actually just pass commands down to the browser, in a different process than your Python program - so the usual approach of creating an object that emulates a file (io.BytesIO) can't work in this case.

Your only approach is to create an in-memory filesystem, and set the browser dwonload directory to have it as its target.How to create an in memory filesystem and were it is located will vary with your Operating System, but on Linux it is as easy as sudo mount -t tmpfs -o size=1024m myramdisk <mountpoint> (Use subprocess or plain os.system to issue that command). You can e even use "/home/user/Downloads" as the mountpoint, and then you won't need to worry about changing any config in the browser.

It will work as a normal filesystem both from your browser and from your selenium script program - the normal file operation calls will work on it - You just have to arrange to de-create the filesystem upon program exit

For that, the "atexit" handler Python has can be usefull - https://docs.python.org/3/library/atexit.html

Upvotes: 1

Related Questions