Reputation: 11
So my problem is that my current program scrapes 1000 games on steam (including title, review, author, etc....) this takes 19 minutes (1140 seconds) for 1000 reviews. However, for 100 reviews it takes 11.5 seconds. My goal is to make it take 115 seconds for 1000 reviews so each iteration takes the same amount of time (about 0.1 seconds per iteration). My current code is listed below.
for y in range(100): # 200 best time is 32 seconds/ 2000 is 19 mins 7 sec
container = browser.find_element(By.ID, "search_resultsRows")
urls_needed = container.find_elements_by_xpath("./child::*")[y]
#links.append(urls_needed[y])
game_title = browser.find_elements_by_class_name("title")[y].text
release_date = browser.find_elements_by_css_selector(
"div.col.search_released.responsive_secondrow"
)[y].text
discount = browser.find_elements_by_css_selector(
"div.col.search_discount.responsive_secondrow"
)[y].text
price = browser.find_elements_by_css_selector(
"div.col.search_price.responsive_secondrow"
)[y].text
game_writer.writerow(
{
"Title": game_title,
"Release Date": release_date,
"Discount": discount,
"Price": price,
"URL": urls_needed.get_attribute("href"),
}
)
if y < 100:
browser.execute_script("window.scrollBy(0, 50);")
The problem is that I use find_elements so that it doesn't scrape the same game 1000 times. I need to be able to use find_element in a loop to scrape all the games so that is no longer increasing the list size of find_elements
but also so it gets the second and third and so on the game in that list.
The link to the page I'm scraping is https://store.steampowered.com/search/?filter=topsellers
EDIT: Beautiful Soup does not work to my knowledge since I need to scroll down the page to load all of the content. Page loads about 50 games at a time and needs to scroll to the bottom to load more each time.
Upvotes: 0
Views: 82
Reputation: 2461
you're searching the DOM too much. You only need to do ONE find_elements: when you're done loading content. Do your scrolling and THEN do the find_elements.
Otherwise you're searching the DOM each time (with repeat elements) and exponentially increasing the search time across the DOM each time.
But really you could just do this with requests and access the paging URL: https://store.steampowered.com/search/results/?query&start=50&count=50&dynamic_data=&sort_by=_ASC&snr=1_7_7_7000_7&filter=topsellers&infinite=1
This result tells you how many total result there are and provides you HTML you can scrape with Beautiful soup. this eliminates the UI/Browser and will increase your speed.
Upvotes: 1
Reputation: 72
I don't if this is really what you want, but if you want it faster, one thing that you can do is to use multithreading: you create several threads and with each thread you search for a range of titles:
import threading
def search_games(range):
for y in range:
game_title = browser.find_elements_by_class_name("title")[y].text
release_date = browser.find_elements_by_css_selector(
"div.col.search_released.responsive_secondrow"
)[y].text
discount = browser.find_elements_by_css_selector(
"div.col.search_discount.responsive_secondrow"
)[y].text
price = browser.find_elements_by_css_selector(
"div.col.search_price.responsive_secondrow"
)[y].text
#you can create as many as you want
job_thread1 = threading.Thread(target=search_games, args=(range1))
job_thread1.start()
Note: don't forget to enter the args as iterable.
Upvotes: 0