Stuxen
Stuxen

Reputation: 738

Scraping dynamic tweets from twitter using Selenium

This may look like a duplicate question but believe me there is something new that I have observed with twitter.

I had previously made a twitter scraper that fetches a given number of tweets using scrolling and waiting for dynamic elements. But it doesn't seem to work now. It doesn't scrape more than 10 tweets. Also the tweets that it scrapes is just the last 10 tweets (of all the tweets I load initially through scrolling)

This function is supposed to scrape atleast n tweets. Roughly 10 tweets show up at the start. So I scroll the page n/10-1 times to load all the n tweets. Then I scrape all the div's with a particular class name.

def get_n_tweets(n, search_str='Covid 19'):
    driver = webdriver.Firefox(executable_path='geckodriver.exe')
    driver.get("http://twitter.com/search?q=" + search_str + "&src=typd")

    response = []
    for x in range(math.ceil(n / 10)-1):
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(5)
    try:
        WebDriverWait(driver, 20).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "div[class='css-1dbjc4n r-1iusvr4 r-16y2uox r-1777fci r-5f2r5o r-1mi0q7o']"))
        )

        e_tweets = driver.find_elements(By.CSS_SELECTOR, "div[class='css-1dbjc4n r-1iusvr4 r-16y2uox r-1777fci r-5f2r5o r-1mi0q7o']")

        for e_tweet in e_tweets:
            e_fullname = e_tweet.find_element(By.CSS_SELECTOR, "div>span[class='css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0']")
            e_tweet_text = e_tweet.find_element(By.CSS_SELECTOR, "div[class='css-901oao r-hkyrab r-1qd0xha r-a023e6 r-16dba41 r-ad9z0x r-bcqeeo r-bnwqim r-qvutc0']")
            response.append({'by': e_fullname.text,
                             'tweet': e_tweet_text.text,
                             'score': TextBlob(e_tweet_text.text).sentiment.polarity})            
    finally:
        driver.quit()
    return response

What I tried? I tried loading as many tweets I needed by scrolling to the bottom of the page, scrolled back up to the start of the page and then fetched the required elements. This is giving StaleElementError.

I suspect this to be the reason for this: In the webpage when I scroll down so that a specified number of tweets load up and then return to the top of the page, the tweets that I had previously loaded disappear.

I am looking for a simple and a standard way to solve this problem. Any help would be greatly appreciated!

Upvotes: 2

Views: 1859

Answers (1)

emporerblk
emporerblk

Reputation: 1066

I've dealt with this behavior on websites before. Your best way forward we be to take advantage of the AbstractEventListener and EventFiringWebDriver classes.

You should first implement a class of TwitterListener, and define the before_execute_script and after_execute_script methods to extract the necessary information from the tweets.

class TwitterListener(AbstractEventListener):

    def __init__(self):
        """Data structures to hold tweets goes here"""

    def before_execute_script(self, url, driver):
        """Scan DOM for tweets and scrape"""

    def after_execute_script(self, url, driver):
        """Scan DOM for new tweets and scrape"""

Then to use this TwitterListener, you utilize the EventFiringWebDriver, which uses all the methods you've come to expect, and the code for script execution will happen automagically!

from [separate file] import TwitterListener

driver = EventFiringWebDriver(executable_path='geckodriver.exe', TwitterListener())

Some things to consider for this approach:

  1. Any data processing such as your TextBlob().sentiment.polarity should happen outside of the tweet scraping loop. I'd recommend using some form of multiprocessing for that.

  2. You might want to move any sleep behavior to the TwitterListener class, to ensure you don't invalidate an element before you have scraped it.

Hope this helps!

Upvotes: 1

Related Questions