tmsprgl
tmsprgl

Reputation: 11

Can't run Nutch2 on Hadoop2 (Nutch 2.x + Hadoop 2.4.0 + HBase 0.94.18 + Gora 0.5 + Avro 1.7.6)

I need to install Nutch 2.3 for EMR in above configuration (subj).

Done on local computer:

  1. Nutch 2.x

1.1 svn current 2.x version

1.2. prepared scripts:

1.2.1 ivy:

    dependency org="org.apache.hadoop" name="hadoop-common" rev="2.4.0"
    dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-core" rev="2.4.0"
    dependency org="org.apache.gora" name="gora" rev="0.5"  
    dependency org="org.apache.gora" name="gora-hbase" rev="0.5"

1.2.2 default.properties:

hadoop.version=2.4.0
version=2.3-SNAPSHOT

1.3. added

public int getFieldsCount() { return Field.values().length; }

to ProtocolStatus.java, ParseStatus.java, Host.java, WebPage.java.

  1. HBase

2.1 svn HBase 0.94.18

2.2 prepared for Protobuf 2.5.0, also thanks to Dobromyslov [ https://github.com/dobromyslov ]

2.3 also generated hbase-0.94.18-hadoop-2.4.0.jar

  1. Gora 0.5 (also was tested for versions 0.4, 0.6-SNAPSHOT, and 0.5.3 from com.argonio.gora)

  2. Avro 1.7.6 (also played with versions 1.7.4, 1.7.7)

4.1 svn

4.2 patched for AVRO-813

4.3 patched for AVRO-882 and rollbacked

4.4 patched as in [1] - commented throwing EOFException against

org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:473),

etc.

After numerous exceptions, some changes have been made in Nutch 2.x and Avro 1.7.6.

Nutch looks like a bit of running, but is unstable and incorrect.

Cycle (inject, generate, fetch, parse, updatedb) passed but some functionalities are broken and ignored.

It seems that i broke the normal data exchange between Nutch and HBase (also with gora and avro). Some fields (and/or some of the data formats) read and write incorrectly. F.e. many markers are lost (temporary emulated in code); data in batchId field are lost; scoring is broken also.

Please help! I'm ready to publish all my diffs and exception traces.

[1] http://mail-archives.apache.org/mod_mbox/nutch-user/201409.mbox/%3cCAEmTxX9HrRM00SxerFAdRdZy=wVAd9xCchDTuLaxPQ=wi0QEsw@mail.gmail.com%3e

Upvotes: 1

Views: 946

Answers (1)

Sergey Weiss
Sergey Weiss

Reputation: 5974

We solved the problem with EOFExceptions and instability by setting old (i.e., hadoop-1.2.0) value for io.serializations property in conf/nutch-site.xml:

<property>
  <name>io.serializations</name>
  <value>org.apache.hadoop.io.serializer.WritableSerialization</value>
  <description>A list of serialization classes that can be used for
  obtaining serializers and deserializers.</description>
</property>

And it turned out that patching Avro is not needed.

Upvotes: 1

Related Questions