Anto
Anto

Reputation: 313

Scrapy sanitize url links

I'm trying to get data from a Web page, where I track all your links. The web is badly modeled, the links in certain parts of the pages contain spaces before and after the link, so scrapy follows and your Web server redirects with 301 creating loops.

I tried to filter the URL of the links, but it is impossible, always returns empty spaces or symbol +.

Part of code

def cleanurl(link_text):
    print "original: ", link_text
    print "filter: ", link_text.strip("\s+\t\r\n '\"")
    return link_text.strip("\s+\t\r\n '\"")
    #return " ".join(link_text.strip("\t\r\n '\""))
    #return link_text.replace("\s", "").replace("\t","").replace("\r","").replace("\n","").replace("'","").replace("\"","")

rules = (
    Rule (LinkExtractor(allow=(), deny=(), process_value= cleanurl)),
)

Web code

<a  href=
                            "                                ?on_sale=1
                            "
                       class="selectBox">ON SALE
                    </a>

Output cleanurl

original:  http://www.portshop.com/computers-networking-c_11257/                                ?on_sale=1

filter:  http://www.portshop.com/computers-networking-c_11257/                                ?on_sale=1

I tried to use regular expressions and others, but I can not sanitize the URL, in some cases if it works in others not, changing the %20 (white spaces) to +.

Thanks !

Upvotes: 0

Views: 498

Answers (2)

Anto
Anto

Reputation: 313

I have already solved, I have entered the following code to clean the URL and now it is working properly. I hope you can help someone else who has the same problem as me.

def cleanurl(link_text):
    return ''.join(link_text.split())

Thanks everybody !

Upvotes: 0

Done Data Solutions
Done Data Solutions

Reputation: 2286

You are mentioning "%20" and "+" to be part of the urls, that's why I suspect these urls are url encoded.

So before stripping them of any whitespaces, you need to urldecode it:

Using Python 3:

import urllib

def cleanurl(link_text):
    print "original: ", link_text
    print "filter: ", link_text.strip("\s\t\r\n '\"")
    link_text = urllib.parse.unquote(link_text)
    return link_text.strip("\s+\t\r\n '\"")

If still using Python 2.7, you need to replace the unquote line:

link_text = urllib.unquote(link_text)

Upvotes: 1

Related Questions