Jihlavanka
Jihlavanka

Reputation: 380

Can this python web-scraper run in dedicated server?

I run this web-scraper on my notebook - it uses Firefox (selenium - webdriver) to get the data - it must actually open the Firefox because the data are created by JavaScript. So I wonder if dedicated server can open Firefox and get the data too - I think dedicated servers have no display so it will not work? The script is much more complicated (well 152 lines) - I pasted only the parts which I think will not work. I believe importing the data into PostgreSQL is no problem in dedicated server.

    from selenium import webdriver
    import time
    from bs4 import BeautifulSoup
    import lxml
    import re
    import psycopg2
    import sys

    driver = webdriver.Firefox()
    driver.set_window_position(-9999, -9999)
    driver.get("http://rodos.vsb.cz/Road.aspx?road=D2")

    time.sleep(20) #waits till the page loads

    html_source = driver.page_source
    soup = BeautifulSoup(html_source, 'lxml')
# finds tags with speed information (km/h)
for i in (soup.find_all("tspan", {"id" : re.compile("tspan_Label_\w*")})):
            if re.match("^[0-9]+$", (str(i.getText()))) is not None:
                if (str(i.parent.get('fill'))) == '#5f5f5f':
                    list1.append(i.getText())

Upvotes: 0

Views: 137

Answers (1)

matyas
matyas

Reputation: 2796

I think what you might be looking for is pyvirtualdisplay:

pip install pyvirtualdisplay

pyvirtualdisplay will emulate the browser of your choice in memory without actually opening a browser.

from pyvirtualdisplay import Display
from selenium import webdriver

# Set screen resolution to 1366 x 768 like most 15" laptops
display = Display(visible=0, size=(1366, 768))
display.start()

# now Firefox will run in a virtual display.
browser = webdriver.Firefox()

# Sets the width and height of the current window
browser.set_window_size(1366, 768)

# Open the URL
browser.get('http://rodos.vsb.cz/Road.aspx?road=D2')

# set timeouts
browser.set_script_timeout(30)
browser.set_page_load_timeout(30) # seconds

time.sleep(20) #waits till the page loads

    html_source = driver.page_source
    soup = BeautifulSoup(html_source, 'lxml')
# finds tags with speed information (km/h)
for i in (soup.find_all("tspan", {"id" : re.compile("tspan_Label_\w*")})):
            if re.match("^[0-9]+$", (str(i.getText()))) is not None:
                if (str(i.parent.get('fill'))) == '#5f5f5f':
                    list1.append(i.getText())

Upvotes: 1

Related Questions