Reputation: 879
I am using nutch for crawling but it is getting failed on urls which have space. I have gone through this link http://lucene.472066.n3.nabble.com/URL-with-Space-td619127.html but did not get satisfactory answer.
It works for URL in the seed.txt file but wont work for URLs in the parsed content of a page
I used a URL that has spaces in the conf/seed.txt file and it replaces the space with %20 and I was able to crawl the page. I have added following in regex-normalize.xml
<regex>
<pattern> </pattern>
<substitution>%20</substitution>
</regex>
Also, I added the reference of regex-normalize.xml in nutch-site.xml. But still I am facing the same problem.
Upvotes: 1
Views: 582
Reputation: 1491
I had the same problem and added this to my regex-normalize.xml
<regex>
<pattern> </pattern>
<substitution>%20</substitution>
</regex>
Upvotes: 1
Reputation: 946
I had the same problem but with more characters so I changed Fetcher.java! New URLs add to Queue in "feeding" section! you have to find this line:
nURL.set(url.toString());
and replace it with this:
nURL.set(URIUtil.encodeQuery(url.toString()));
Upvotes: 1