Reputation: 313
I'm trying to get data from a Web page, where I track all your links. The web is badly modeled, the links in certain parts of the pages contain spaces before and after the link, so scrapy follows and your Web server redirects with 301 creating loops.
I tried to filter the URL of the links, but it is impossible, always returns empty spaces or symbol +.
def cleanurl(link_text):
print "original: ", link_text
print "filter: ", link_text.strip("\s+\t\r\n '\"")
return link_text.strip("\s+\t\r\n '\"")
#return " ".join(link_text.strip("\t\r\n '\""))
#return link_text.replace("\s", "").replace("\t","").replace("\r","").replace("\n","").replace("'","").replace("\"","")
rules = (
Rule (LinkExtractor(allow=(), deny=(), process_value= cleanurl)),
)
<a href=
" ?on_sale=1
"
class="selectBox">ON SALE
</a>
original: http://www.portshop.com/computers-networking-c_11257/ ?on_sale=1
filter: http://www.portshop.com/computers-networking-c_11257/ ?on_sale=1
I tried to use regular expressions and others, but I can not sanitize the URL, in some cases if it works in others not, changing the %20 (white spaces) to +.
Thanks !
Upvotes: 0
Views: 498
Reputation: 313
I have already solved, I have entered the following code to clean the URL and now it is working properly. I hope you can help someone else who has the same problem as me.
def cleanurl(link_text):
return ''.join(link_text.split())
Thanks everybody !
Upvotes: 0
Reputation: 2286
You are mentioning "%20" and "+" to be part of the urls, that's why I suspect these urls are url encoded.
So before stripping them of any whitespaces, you need to urldecode it:
Using Python 3:
import urllib
def cleanurl(link_text):
print "original: ", link_text
print "filter: ", link_text.strip("\s\t\r\n '\"")
link_text = urllib.parse.unquote(link_text)
return link_text.strip("\s+\t\r\n '\"")
If still using Python 2.7, you need to replace the unquote line:
link_text = urllib.unquote(link_text)
Upvotes: 1