Reputation: 11
I need to install Nutch 2.3 for EMR in above configuration (subj).
Done on local computer:
1.1 svn current 2.x version
1.2. prepared scripts:
1.2.1 ivy:
dependency org="org.apache.hadoop" name="hadoop-common" rev="2.4.0" dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-core" rev="2.4.0" dependency org="org.apache.gora" name="gora" rev="0.5" dependency org="org.apache.gora" name="gora-hbase" rev="0.5"
1.2.2 default.properties:
hadoop.version=2.4.0
version=2.3-SNAPSHOT
1.3. added
public int getFieldsCount() { return Field.values().length; }
to ProtocolStatus.java, ParseStatus.java, Host.java, WebPage.java.
2.1 svn HBase 0.94.18
2.2 prepared for Protobuf 2.5.0, also thanks to Dobromyslov [ https://github.com/dobromyslov ]
2.3 also generated hbase-0.94.18-hadoop-2.4.0.jar
Gora 0.5 (also was tested for versions 0.4, 0.6-SNAPSHOT, and 0.5.3 from com.argonio.gora)
Avro 1.7.6 (also played with versions 1.7.4, 1.7.7)
4.1 svn
4.2 patched for AVRO-813
4.3 patched for AVRO-882 and rollbacked
4.4 patched as in [1] - commented throwing EOFException against
org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:473),
etc.
After numerous exceptions, some changes have been made in Nutch 2.x and Avro 1.7.6.
Nutch looks like a bit of running, but is unstable and incorrect.
Cycle (inject, generate, fetch, parse, updatedb) passed but some functionalities are broken and ignored.
It seems that i broke the normal data exchange between Nutch and HBase (also with gora and avro). Some fields (and/or some of the data formats) read and write incorrectly. F.e. many markers are lost (temporary emulated in code); data in batchId field are lost; scoring is broken also.
Please help! I'm ready to publish all my diffs and exception traces.
Upvotes: 1
Views: 946
Reputation: 5974
We solved the problem with EOFException
s and instability by setting old (i.e., hadoop-1.2.0) value for io.serializations
property in conf/nutch-site.xml:
<property>
<name>io.serializations</name>
<value>org.apache.hadoop.io.serializer.WritableSerialization</value>
<description>A list of serialization classes that can be used for
obtaining serializers and deserializers.</description>
</property>
And it turned out that patching Avro is not needed.
Upvotes: 1