Scrapy sanitize url links

Question

I'm trying to get data from a Web page, where I track all your links. The web is badly modeled, the links in certain parts of the pages contain spaces before and after the link, so scrapy follows and your Web server redirects with 301 creating loops.

I tried to filter the URL of the links, but it is impossible, always returns empty spaces or symbol +.

Part of code

def cleanurl(link_text):
    print "original: ", link_text
    print "filter: ", link_text.strip("\s+	
 '"")
    return link_text.strip("\s+	
 '"")
    #return " ".join(link_text.strip("	
 '""))
    #return link_text.replace("\s", "").replace("	","").replace("
","").replace("
","").replace("'","").replace(""","")

rules = (
    Rule (LinkExtractor(allow=(), deny=(), process_value= cleanurl)),
)

Web code

ON SALE

Output cleanurl

original:  http://www.portshop.com/computers-networking-c_11257/                                ?on_sale=1

filter:  http://www.portshop.com/computers-networking-c_11257/                                ?on_sale=1

I tried to use regular expressions and others, but I can not sanitize the URL, in some cases if it works in others not, changing the %20 (white spaces) to +.

Thanks !

Anto · Accepted Answer

I have already solved, I have entered the following code to clean the URL and now it is working properly. I hope you can help someone else who has the same problem as me.

def cleanurl(link_text):
    return ''.join(link_text.split())

Thanks everybody !

Scrapy sanitize url links

Part of code

Web code

Output cleanurl

Answers (2)

Related Questions