Reputation: 307
So guys, for the past 18 hours, I've desperately been trying to find a workaround for a bug in my code, and I think it's time for me to seek for some help.
I'm building a web scraper, its goal is to download a page, grab anchor texts, internal links, referrer URL, and save data to DB. Here's the relevant part of my Scrapy code;
def parse_items(self, response):
item = InternallinkItem()
# Current URL
item["current_url"] = response.url
# get anchor text and clean it
anchor = response.meta.get('link_text')
item["anchor_text"] = " ".join(anchor.split())
# get the referrer URL (Problem is here)
referring_url = response.request.headers.get('Referer')
item["referring_url"] = referring_url
yield item
Technologies I use are Python, Scrapy, Elasticsearch. They are all up to date and my dev environment is Windows. When I run the code above, I'm faced with this error;
raise TypeError("Unable to serialize %r (type: %s)" % (data, type(data)))
TypeError: Unable to serialize b'https://example.com' (type: <class 'bytes'>)
So, after so many trial and error, I was able to track it down and pinpoint the issue. When I remove the part that grabs the referrer URL, everything works just fine. It gets the data I want and saves to Elasticsearch successfully.
As someone who's fairly new to programming, I have no idea how to proceed.
I tried, grabbing referrer URL in some other way, didn't work..
Tried writing my own pipeline, instead of using scrapy-elasticsearch library, but got the same error, also gave a shot to changing type from byte to STR, well you guessed it right, it didn't work either.
Any help would be highly appreciated as I am really stuck here!
EDIT: My settings.py file;
ITEM_PIPELINES = {
'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline': 500
}
ELASTICSEARCH_SERVERS = ['localhost']
ELASTICSEARCH_INDEX = 'myindex'
ELASTICSEARCH_TYPE = 'internallink'
#ELASTICSEARCH_UNIQ_KEY = ['current_url']
Upvotes: 1
Views: 5411
Reputation: 21
Make sure your installed Elasticsearch module is compatible with the Elasticsearch server.
Scrapy-elasticsearch uses v7~ of Elasticsearch module but your server might not be updated. I had the same problem, downgrading the module fixed it.
Docs:
For Elasticsearch 7.0 and later, use the major version 7 (7.x.y) of the library.
For Elasticsearch 6.0 and later, use the major version 6 (6.x.y) of the library.
For Elasticsearch 5.0 and later, use the major version 5 (5.x.y) of the library.
For Elasticsearch 2.0 and later, use the major version 2 (2.x.y) of the library, and so on.
Upvotes: 0
Reputation: 307
Okay, after consuming 9 cups of coffee and banging my head on the wall for 20 hours, I was able to fix the issue. It's so simple I'm almost ashamed to post it here, but here goes nothing;
When I first got the error yesterday, I tried decoding the referrer like this
referring_url = response.request.headers.get('Referer')
item["referring_url"] = referring_url.decode('utf-8')
It didn't work... Until I change it to this;
referring_url = response.request.headers.get('Referer').decode('utf-8')
item["referring_url"] = referring_url
I don't know why or how, but it works.
Huge thanks to @alecxe and @furas for pushing me in the right direction.
Upvotes: 2