abhijeet
abhijeet

Reputation: 879

How to crawl urls having space using Apache Nutch?

I am using nutch for crawling but it is getting failed on urls which have space. I have gone through this link http://lucene.472066.n3.nabble.com/URL-with-Space-td619127.html but did not get satisfactory answer.

It works for URL in the seed.txt file but wont work for URLs in the parsed content of a page

I used a URL that has spaces in the conf/seed.txt file and it replaces the space with %20 and I was able to crawl the page. I have added following in regex-normalize.xml

<regex> 
 <pattern> </pattern> 
 <substitution>%20</substitution> 
</regex>                                                                    

Also, I added the reference of regex-normalize.xml in nutch-site.xml. But still I am facing the same problem.

Upvotes: 1

Views: 582

Answers (2)

Allan Macmillan
Allan Macmillan

Reputation: 1491

I had the same problem and added this to my regex-normalize.xml

<regex> 
   <pattern>&#x20;</pattern> 
   <substitution>%20</substitution> 
</regex> 

Upvotes: 1

Mohsen ZareZardeyni
Mohsen ZareZardeyni

Reputation: 946

I had the same problem but with more characters so I changed Fetcher.java! New URLs add to Queue in "feeding" section! you have to find this line:

nURL.set(url.toString());

and replace it with this:

nURL.set(URIUtil.encodeQuery(url.toString()));

Upvotes: 1

Related Questions