Python: Scrapy returning all html following element instead of just html of element

Question

I am having an issue where Scrapy is behaving unexpectedly.

I wrote a simple function months ago that returns a list of items at a given xpath.

def get_html(response,path):
    sel = Selector(text = response.page_source)
    time.sleep(.2)
    items = sel.xpath(path).getall()
    return items

Usage Example:


    Some Text
    Different Text
    Some link

If I wanted to get all of the div elements, I would write this:

get_html(response,'//div')

I expect, and have previously received, this output

['Some Text',
 'Different Text']

However, now when I call this method, I receive this output

['Some Text
Different Text
Some link',
 'Different Text
Some link']

The problem isn't due to a change in the webpage I was scraping, I saved the source code when I originally scraped and it is identical to the source code I see on the webpage today. This problem exists across multiple websites I've tried to scrape. I'm not sure what the problem is, or how to fix it. I either need to fix the problem, or replace the function with another function that behaves identically.

I understand there are ways I could split the strings and remove the unwanted data, however I have used this function in 100+ modules, and do not want to risk breaking those by hardcoding a solution like that. I need to understand why the output of the function has changed, despite nothing about the source code changing.

Edit:

Per comments below, here is exactly what I enter into the console to produce this result. Please let me know how I can begin to diagnose why this is happening if it's not reproduceable for others. I am using Spyder version 4.2.5, Python 3.8.5, Scrapy 2.4.1.

In[1]: from scrapy.selector import Selector

In[2]: text = """
        Some Text
        Different Text
        Some link
    """

In[3]: sel = Selector(text=text)

In[4]: items = sel.xpath('//div').getall()

In[5]: items
Out[5]: 
['Some Text
        Different Text
        Some link
    
',
 'Different Text
        Some link
    
']

Madison Ashbach · Accepted Answer

Problem appears to be fixed after a fresh install of Anaconda. Not sure what caused it to appear in the first place, here's hoping it doesn't happen again.

Python: Scrapy returning all html following element instead of just html of element

Answers (2)

Related Questions