Reputation: 83
I have a number of scripts that scrape the web, grab files, then read them using pandas. This procedure must be deployed under a new architecture in which downloading files from disc is not acceptable; instead, the file should be saved in memory and read with pandas from there.
The Websites doesn't provide a direct link to the file rather it has provided with a button that uses form submission to download it.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager
chrome_options = webdriver.ChromeOptions()
prefs = {'download': {'default_directory': #a link to memory}}
chrome_options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(options=chrome_options,service=Service(ChromeDriverManager().install()))
driver.get("https://www.speedchex.com/")
driver = login(driver)
WebDriverWait(driver,15).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,"#CSVButton")))
driver.find_element(By.CSS_SELECTOR,'#CSVButton').click() #This button Downloads the file.
download_wait() # a function to check if download is finished or not
driver.quit()
Donwload_wait is just a function that will check the directory if there are any .crdownload
files.
def download_wait():
path_to_downloads = OUTPUT_FOLDER
seconds = 0
dl_wait = True
while dl_wait and seconds < 200:
time.sleep(1)
dl_wait = False
for fname in os.walk.files(filter=['*.crdownload']):
dl_wait = True
seconds += 1
return seconds
The input tag that downloads the file is as follows.
<input name="CSVButton" type="button" id="CSVButton" onclick="javascript: this.form.OutputType.value = 'CSV'; this.form.submit(); this.form.OutputType.value = 'HTML'; " value="CSV">
Upvotes: 1
Views: 927
Reputation: 110291
Selenium actually just pass commands down to the browser, in a different process than your Python program - so the usual approach of creating an object that emulates a file (io.BytesIO
) can't work in this case.
Your only approach is to create an in-memory filesystem, and set the browser dwonload directory to have it as its target.How to create an in memory filesystem and were it is located will vary with your Operating System, but on Linux it is as easy as sudo mount -t tmpfs -o size=1024m myramdisk <mountpoint>
(Use subprocess or plain os.system
to issue that command). You can e even use "/home/user/Downloads" as the mountpoint, and then you won't need to worry about changing any config in the browser.
It will work as a normal filesystem both from your browser and from your selenium script program - the normal file operation calls will work on it - You just have to arrange to de-create the filesystem upon program exit
For that, the "atexit" handler Python has can be usefull - https://docs.python.org/3/library/atexit.html
Upvotes: 1