Hadi Gol
Hadi Gol

Reputation: 15

Using Nutch 2.3 all my seed urls are being rejected

I have 84 urls in my dmoz/urls file when i execute the command: bin/nutch inject dmoz

i get the following:

[ec2-user@ip-172-31-47-66 local]$ bin/nutch inject dmoz/
InjectorJob: starting at 2015-07-03 02:33:41
InjectorJob: Injecting urlDir: dmoz
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 84
InjectorJob: total number of urls injected after normalization and filtering: 0
Injector: finished at 2015-07-03 02:33:44, elapsed: 00:00:03

All URLS are being rejected, here is a snippet of my nutch/conf/regex-url.xml

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin

-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.

below is my hadoop.log output for this execution:

2015-07-03 02:33:41,095 INFO  crawl.InjectorJob - InjectorJob: starting at 2015-07-03 02:33:41
2015-07-03 02:33:41,096 INFO  crawl.InjectorJob - InjectorJob: Injecting urlDir: dmoz
2015-07-03 02:33:43,301 INFO  crawl.InjectorJob - InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
2015-07-03 02:33:43,329 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-07-03 02:33:43,389 WARN  snappy.LoadSnappy - Snappy native library not loaded
2015-07-03 02:33:44,278 INFO  regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2015-07-03 02:33:44,430 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2015-07-03 02:33:44,768 INFO  crawl.InjectorJob - InjectorJob: total number of urls rejected by filters: 84
2015-07-03 02:33:44,768 INFO  crawl.InjectorJob - InjectorJob: total number of urls injected after normalization and filtering: 0
2015-07-03 02:33:44,769 INFO  crawl.InjectorJob - Injector: finished at 2015-07-03 02:33:44, elapsed: 00:00:03

I Highly appreciate if someone can help me out with this, basically all my urls are being rejected and im not sure why.

Thanks -Hadi

Upvotes: 0

Views: 1156

Answers (2)

aperfectpoint
aperfectpoint

Reputation: 51

If you are using the /local runtime environment, you don't need to recompile for every change in a conf/ file.

After you built nutch's runtime (using >ant runtime), the compilation creates the /local environment under $NUTCH_HOME/runtime/local. Under this, there is a conf/ directory, which is essentially a copy of $NUTCH_HOME/conf. However, you can (and should) edit the files there in order to change the /local configuration.

Thus, if you want to change the name of your crawler, for example, edit $NUTCH_HOME/runtime/local/conf/nutch-site.xml and add/edit the property http.agent.name to whatever name you want.

Upvotes: 1

Hadi Gol
Hadi Gol

Reputation: 15

Well, After spending a lot of time trying to figure things out... since I had changed the conf/regex-urlfilter.txt, i had to rebuild nutch using "ant runtime"... and things ended up working, so my conclusion and lesson for the past 2 days is, that always compile nutch after conf changes.

Upvotes: 0

Related Questions