Download a PDF file using Python Selenium and Firefox Driver in AWS EC2 Service

I'm working on a Python project where part of the functionality involves automatically downloading PDF files from a website using Selenium and Firefox, and then uploading these files to a specific bucket. This process involves Selenium saving the PDFs temporarily to the /tmp directory before they are processed further. My development environment is set up inside a Docker container where this setup works perfectly—I am able to download and find the PDFs in the /tmp directory without any issues using Selenium versions 3.141.0, 3.8.0, or 4.18.1, along with Firefox as the browser and its corresponding driver.

However, when I deploy this application to an AWS EC2 instance, the behavior changes. The application runs as expected in terms of interaction with the website, and there are no errors thrown by Selenium or the application itself. But, the PDF file that should be downloaded and appear in the /tmp directory is nowhere to be found.

I was expecting the PDF file to be downloaded to the /tmp directory on the EC2 instance just as it does locally in the Docker container, allowing my script to then upload it to the specified bucket. Instead, the file does not appear in the directory at all, even though the application reports successful click of the download button, and there are no errors logged related to file writing or Selenium's interaction with Firefox.

This is my current logic:

login_url = 'https://my-web.com/login/'
            dashboard_url = f'https://my-web.com/pdf-button-download-view'
            unique_dir = os.path.join("/tmp/pdfs", str(uuid.uuid4()))
            os.makedirs(unique_dir, exist_ok=True)

            os.chmod(unique_dir, stat.S_IRWXU | stat.S_IRWXG | stat.S_IRWXO)

            options = Options()
            options.headless = True
            log_path = "/tmp/geckodriver.log"

            # Firefox Profile for specifying download behavior
            profile = webdriver.FirefoxProfile()
            profile.set_preference("browser.download.folderList", 2)
            profile.set_preference("browser.download.manager.showWhenStarting", False)
            profile.set_preference("browser.download.dir", unique_dir)
            profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")

            driver = webdriver.Firefox(options=options, firefox_profile=profile, service_log_path=log_path)

            driver.get(login_url)
            time.sleep(10)

            # Login
            driver.find_element(By.ID, "username").send_keys("admin")
            driver.find_element(By.ID, "password").send_keys("admin")
            driver.find_element(By.CSS_SELECTOR, "form").submit()
            time.sleep(10)

            # Navigate to the Dashboard
            driver.get(dashboard_url)
            time.sleep(10)

            logging.info(f"Dashboard URL loaded: {driver.current_url}")

            dropdown_trigger = driver.find_element(By.XPATH, "//button[@aria-label='Menu actions trigger']")
            dropdown_trigger.click()

            logging.info("Dropdown trigger clicked.")

            action = ActionChains(driver)

            logging.info("Attempting to find the dropdown item for download.")

            dropdown_item = driver.find_element(By.XPATH, "//div[@title='Download']")
            action.move_to_element(dropdown_item).perform()

            logging.info("The dropdown item was founded.")

            logging.info("Attempting to click the 'Export to PDF' button.")

            export_to_pdf_button = WebDriverWait(driver, 3).until(
                EC.element_to_be_clickable((By.XPATH, "//div[@role='button'][contains(text(), 'Export to PDF')]"))
            )

            export_to_pdf_button.click()

            logging.info("Export to PDF button clicked, waiting for file to download.")

            start_time = time.time()
            while True:
                pdf_files = [f for f in os.listdir(unique_dir) if f.endswith(".pdf")]
                if pdf_files:
                    break
                elif time.time() - start_time > 60:  # Wait up to 60 seconds for the file
                    raise Exception("File download timed out.")
                time.sleep(1)

            driver.quit()

            logging.info("Driver quit.")

In EC2 Service the PDF is not generated in the unique_dir directory. There are 21GB available on the machine. The /tmp/pdfs directory is successfully generated but it is always empty.

Upvotes: 0

Views: 135

Answers (1)

nameloCmaS
nameloCmaS

Reputation: 46

Add in the below addition preference to make it use the given download.dir:

profile.set_preference("browser.download.useDownloadDir", True)

Upvotes: 0

Related Questions