tessa
tessa

Reputation: 828

Palantir Foundry sidecar transform timing out

I am trying to build a proof of concept for web scraping using Selenium in Foundry. The requirement is to scrape potentially hundreds-thousands of websites using Foundry. I realize this is probably not ideal, but I was given this requirement and I intend to find a way to make it work unless someone from Palantir can tell me why it won't work and that this is a terrible idea. Beautiful soup works fine in regular transforms, but we will need something to scrape dynamically generated content. Due to the need to run a browser, I am not sure how to make Selenium work other than containers and sidecar transforms - if there is a better way please let me know.

My sidecar transform build is timing out, and I don't know what the issue is or how to begin troubleshooting it. The sidecar transform runs 20-30 minutes then fails with the following error:

[module version: 1.1132.0]

Spark module 'ri.spark-module-manager.main.spark-module.e3afe96c-4d51-44a4-a687-1174dfba2fb4' died while job 'ri.foundry.main.job.f00598ef-e11e-43c4-82bf-ac63d057294a' was using it. (ExitReason: MODULE_UNREACHABLE)

Module exit details: Module became unreachable after registration. This likely indicates the module has died. Module became unreachable for an unknown reason.

Here is the sidecar transform, mostly just copied from the documentation with an egress policy added for the one test website we're scraping:

from transforms.api import transform, Input, Output, configure
from transforms.sidecar import sidecar, Volume
from myproject.datasets.utils import copy_files_to_shared_directory, copy_output_files
from myproject.datasets.utils import copy_start_flag, wait_for_done_flag, copy_close_flag, launch_udf_once
from transforms.external.systems import use_external_systems, EgressPolicy, Credential


@use_external_systems(
    egress=EgressPolicy('{POLICY RID}')
)
@configure(["NUM_EXECUTORS_64",
        'EXECUTOR_MEMORY_LARGE', 'EXECUTOR_MEMORY_OVERHEAD_LARGE',
        'DRIVER_MEMORY_EXTRA_EXTRA_LARGE', 'DRIVER_MEMORY_OVERHEAD_EXTRA_LARGE'
        ])
@sidecar(image='{PACKAGE_NAME}', tag='0.3', volumes=[Volume("shared")])
@transform(
    output=Output("OUTPUT"),
)
def compute(ctx, output, egress):

    def user_defined_function(row):
        # Copy files from source to shared directory.
        # copy_files_to_shared_directory(source)
        # Send the start flag so the container knows it has all the input files
        copy_start_flag()
        # Iterate till the stop flag is written or we hit the max time limit
        wait_for_done_flag()
        # Copy out output files from the container to an output dataset
        output_fnames = [
            "start_flag",
            # "outfile.csv",
            "logfile",
            "done_flag",
        ]
        copy_output_files(output, output_fnames)
        # Write the close flag so the container knows you have extracted the data
        copy_close_flag()
        # The user defined function must return something
        return (row.ExecutionID, "success")
    # This spawns one task, which maps to one executor, and launches one "sidecar container"
    launch_udf_once(ctx, user_defined_function)

Dockerfile:

FROM --platform=linux/amd64 python:3.9-buster

RUN mkdir /code

# Keeps Python from generating .pyc files in the container
ENV PYTHONDONTWRITEBYTECODE=1

# Turns off buffering for easier container logging
ENV PYTHONUNBUFFERED=1

# please review all the latest versions here:
# https://googlechromelabs.github.io/chrome-for-testing/
ENV CHROMEDRIVER_VERSION=123.0.6312.122

### install chrome
# https://storage.googleapis.com/chrome-for-testing-public/123.0.6312.122/linux64/chrome-linux64.zip
RUN apt-get update && apt-get install -y wget && apt-get install -y zip
# RUN wget -q https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
# RUN apt-get install -y ./google-chrome-stable_current_amd64.deb
COPY google-chrome-stable_current_amd64.deb .
RUN apt-get install -y ./google-chrome-stable_current_amd64.deb

### install chromedriver
# RUN wget https://storage.googleapis.com/chrome-for-testing-public/123.0.6312.122/linux64/chromedriver-linux64.zip \
#   && unzip chromedriver-linux64.zip && rm -dfr chromedriver_linux64.zip \
#   && mv /chromedriver-linux64/chromedriver /usr/bin/chromedriver \
#   && chmod +x /usr/bin/chromedriver
COPY chromedriver-linux64.zip .
RUN unzip chromedriver-linux64.zip && rm -dfr chromedriver_linux64.zip \
  && mv /chromedriver-linux64/chromedriver /usr/bin/chromedriver \
  && chmod +x /usr/bin/chromedriver

# set display port to avoid crash
ENV DISPLAY=:99

# install selenium
RUN pip install selenium==4.3.0

ADD entrypoint.py /usr/bin/
ADD scraper.py /usr/bin/
RUN chmod +x /usr/bin/

RUN mkdir -p /opt/palantir/sidecars/shared-volumes/shared/
RUN chown 5001 /opt/palantir/sidecars/shared-volumes/shared/
ENV SHARED_DIR=/opt/palantir/sidecars/shared-volumes/shared

USER 5001

CMD ["/usr/bin/entrypoint.py"]
ENTRYPOINT ["python"]

entrypoint.py, also mostly copied from the documentation:

import os
import time
import subprocess
from datetime import datetime

import argparse

def run_process():
    "Define a function for running commands and capturing stdout line by line"
    p = subprocess.Popen(["python", "/usr/bin/scraper.py"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    out, err = p.communicate()
    return (p.returncode, out, err)

# debug
'''
item = run_process()
my_string = f"{datetime.utcnow().isoformat()}: {item}"
print(my_string)
'''

start_flag_fname = "/opt/palantir/sidecars/shared-volumes/shared/start_flag"
done_flag_fname = "/opt/palantir/sidecars/shared-volumes/shared/done_flag"
close_flag_fname = "/opt/palantir/sidecars/shared-volumes/shared/close_flag"

# Wait for start flag
print(f"{datetime.utcnow().isoformat()}: waiting for start flag")
while not os.path.exists(start_flag_fname):
    time.sleep(1)
print(f"{datetime.utcnow().isoformat()}: start flag detected")

# Execute model, logging output to file
with open("/opt/palantir/sidecars/shared-volumes/shared/logfile", "w") as logfile:
    item = run_process()
    my_string = f"{datetime.utcnow().isoformat()}: {item}"
    print(my_string)
    logfile.write(my_string)
    logfile.flush()
print(f"{datetime.utcnow().isoformat()}: execution finished writing output file")

# Write out the done flag
open(done_flag_fname, "w")
print(f"{datetime.utcnow().isoformat()}: done flag file written")

# Wait for close flag before allowing the script to finish
while not os.path.exists(close_flag_fname):
    time.sleep(1)
print(f"{datetime.utcnow().isoformat()}: close flag detected. shutting down")

scraper.py, basic Selenium run test:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# Define options for running the chromedriver
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-dev-shm-usage")

# Initialize a new chrome driver instance
driver = webdriver.Chrome(options=chrome_options)

driver.get('{WEBSITE}')
print(driver.page_source)

driver.quit()

Upvotes: 0

Views: 225

Answers (1)

tessa
tessa

Reputation: 828

This was user error - one of our forward deployed engineers did a quick code review and noticed I wasn't copying the scraper.py file to the container. I updated the Dockerfile in the original question. The container is now running in Foundry, and I'm getting Selenium related errors which is promising that this will work. I'm still very interested to know if anyone else out there has a large scale web scraping setup in Foundry, or any thoughts on this from Palantir engineers. Thank you!

Upvotes: 0

Related Questions