cem akbulut
cem akbulut

Reputation: 307

My Scrapy/Elasticsearch script returns "Unable to serialize" error

So guys, for the past 18 hours, I've desperately been trying to find a workaround for a bug in my code, and I think it's time for me to seek for some help.

I'm building a web scraper, its goal is to download a page, grab anchor texts, internal links, referrer URL, and save data to DB. Here's the relevant part of my Scrapy code;

def parse_items(self, response):
    item = InternallinkItem()

    # Current URL
    item["current_url"] = response.url

    # get anchor text and clean it
    anchor = response.meta.get('link_text')
    item["anchor_text"] = " ".join(anchor.split())


   # get the referrer URL (Problem is here)
   referring_url = response.request.headers.get('Referer')
   item["referring_url"] = referring_url


    yield item

Technologies I use are Python, Scrapy, Elasticsearch. They are all up to date and my dev environment is Windows. When I run the code above, I'm faced with this error;

raise TypeError("Unable to serialize %r (type: %s)" % (data, type(data)))

TypeError: Unable to serialize b'https://example.com' (type: <class 'bytes'>)

So, after so many trial and error, I was able to track it down and pinpoint the issue. When I remove the part that grabs the referrer URL, everything works just fine. It gets the data I want and saves to Elasticsearch successfully.

As someone who's fairly new to programming, I have no idea how to proceed.

I tried, grabbing referrer URL in some other way, didn't work..

Tried writing my own pipeline, instead of using scrapy-elasticsearch library, but got the same error, also gave a shot to changing type from byte to STR, well you guessed it right, it didn't work either.

Any help would be highly appreciated as I am really stuck here!

EDIT: My settings.py file;

ITEM_PIPELINES = {
    'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline': 500
}

ELASTICSEARCH_SERVERS = ['localhost'] 
ELASTICSEARCH_INDEX = 'myindex'
ELASTICSEARCH_TYPE = 'internallink'
#ELASTICSEARCH_UNIQ_KEY = ['current_url']

Upvotes: 1

Views: 5411

Answers (2)

Martin Fredriksson
Martin Fredriksson

Reputation: 21

Make sure your installed Elasticsearch module is compatible with the Elasticsearch server.

Scrapy-elasticsearch uses v7~ of Elasticsearch module but your server might not be updated. I had the same problem, downgrading the module fixed it.

Docs:

For Elasticsearch 7.0 and later, use the major version 7 (7.x.y) of the library.

For Elasticsearch 6.0 and later, use the major version 6 (6.x.y) of the library.

For Elasticsearch 5.0 and later, use the major version 5 (5.x.y) of the library.

For Elasticsearch 2.0 and later, use the major version 2 (2.x.y) of the library, and so on.

Upvotes: 0

cem akbulut
cem akbulut

Reputation: 307

Okay, after consuming 9 cups of coffee and banging my head on the wall for 20 hours, I was able to fix the issue. It's so simple I'm almost ashamed to post it here, but here goes nothing;

When I first got the error yesterday, I tried decoding the referrer like this

    referring_url = response.request.headers.get('Referer')
    item["referring_url"] = referring_url.decode('utf-8')

It didn't work... Until I change it to this;

    referring_url = response.request.headers.get('Referer').decode('utf-8')
    item["referring_url"] = referring_url

I don't know why or how, but it works.

Huge thanks to @alecxe and @furas for pushing me in the right direction.

Upvotes: 2

Related Questions