user3752941
user3752941

Reputation: 71

Sqoop import to avrodatafile or Parquet files failing in dataproc clusters

When we run sqoop import in the GCP dataproc clusters to either avrodatafile or parquetfile it fails with the below errors. However, import to textfile works. Feels like we might need some additional JARs are required.

The required Sqoop jars are loaded from GCS.

COMMAND used:

gcloud dataproc jobs submit hadoop \
            --cluster={cluster_name} \
            --region=us-central1 \
            --class=org.apache.sqoop.Sqoop --jars={sqoop_jars_gcs}/sqoop-1.4.7.jar,{sqoop_jars_gcs}/avro-1.8.2.jar,{sqoop_jars_gcs}/terajdbc4.jar,{sqoop_jars_gcs}/log4j-1.2.17.jar,{sqoop_jars_gcs}/sqoop-connector-teradata-1.2c5.jar,{sqoop_jars_gcs}/tdgssconfig.jar,{sqoop_jars_gcs}/avro-1.8.2.jar \
            -- import \
            -Dmapreduce.job.user.classpath.first=true \
            -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
            --connect={db_connection}DATABASE={source_db} \
            --username={userid} \
            --password-file {passfile}  \
            --driver com.teradata.jdbc.TeraDriver \
            -e "sql query AND \$CONDITIONS" \
            --target-dir=<dir> \
            --delete-target-dir \
            --as-<avrodatafile/parquetfile> \
            --split-by <column>

Error when running --as-avrodatafile: We have the avro-1.8.2.jar in classpath but still no luck.

INFO - Error: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
INFO -      at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:135)
INFO -      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:753)
INFO -      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
INFO -      at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
INFO -      at java.security.AccessController.doPrivileged(Native Method)
INFO -      at javax.security.auth.Subject.doAs(Subject.java:422)
INFO -      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
INFO -      at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
INFO - Caused by: java.lang.reflect.InvocationTargetException
INFO -      at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
INFO -      at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
INFO -      at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
INFO -      at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
INFO -      at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
INFO -      ... 7 more
INFO - Caused by: java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
INFO -      at org.apache.sqoop.mapreduce.AvroImportMapper.<init>(AvroImportMapper.java:43)
INFO -      ... 12 more
INFO - Caused by: java.lang.ClassNotFoundException: org.apache.avro.mapred.AvroWrapper
INFO -      at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
INFO -      at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
INFO -      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
INFO -      at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
INFO -      ... 13 more

Error when running --as-parquetfile:

INFO - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
INFO - at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
INFO - at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
INFO - at java.lang.reflect.Method.invoke(Method.java:498)
INFO - at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)
INFO - Caused by: java.lang.NoClassDefFoundError: org/kitesdk/data/mapreduce/DatasetKeyOutputFormat
INFO - at org.apache.sqoop.mapreduce.DataDrivenImportJob.getOutputFormatClass(DataDrivenImportJob.java:213)
INFO - at org.apache.sqoop.mapreduce.ImportJobBase.configureOutputFormat(ImportJobBase.java:98)
INFO - at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:263)
INFO - at org.apache.sqoop.manager.SqlManager.importQuery(SqlManager.java:748)
INFO - at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:522)
INFO - at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:628)
INFO - at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
INFO - at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
INFO - at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
INFO - at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
INFO - at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
INFO - at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
INFO - ... 5 more
INFO - Caused by: java.lang.ClassNotFoundException: org.kitesdk.data.mapreduce.DatasetKeyOutputFormat
INFO - at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
INFO - at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
INFO - at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
INFO - at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
INFO - ... 17 more

Upvotes: 1

Views: 892

Answers (2)

xgMz
xgMz

Reputation: 3354

This worked for me only when using a very specific set of jar versions, mostly from Cloudera as follows -

# Jars used:
#   https://repo1.maven.org/maven2/org/apache/parquet/parquet-format/2.9.0/parquet-format-2.9.0.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/sqoop/sqoop/1.4.7.7.2.10.0-148/sqoop-1.4.7.7.2.10.0-148.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/kitesdk/kite-data-core/1.0.0-cdh6.3.4/kite-data-core-1.0.0-cdh6.3.4.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/kitesdk/kite-data-mapreduce/1.0.0-cdh6.3.4/kite-data-mapreduce-1.0.0-cdh6.3.4.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/kitesdk/kite-hadoop-compatibility/1.0.0-cdh6.3.4/kite-hadoop-compatibility-1.0.0-cdh6.3.4.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/avro/avro/1.8.2.7.2.10.0-148/avro-1.8.2.7.2.10.0-148.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/avro/avro-mapred/1.8.2.7.2.10.0-148/avro-mapred-1.8.2.7.2.10.0-148.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-avro/1.10.99.7.2.10.0-148/parquet-avro-1.10.99.7.2.10.0-148.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-common/1.10.99.7.2.10.0-148/parquet-common-1.10.99.7.2.10.0-148.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-column/1.10.99.7.2.10.0-148/parquet-column-1.10.99.7.2.10.0-148.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-hadoop/1.10.99.7.2.10.0-148/parquet-hadoop-1.10.99.7.2.10.0-148.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-jackson/1.10.99.7.2.10.0-148/parquet-jackson-1.10.99.7.2.10.0-148.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-encoding/1.10.99.7.2.10.0-148/parquet-encoding-1.10.99.7.2.10.0-148.jar
 
export CLUSTER_NAME=
export CLUSTER_REGION=us-central1 # update accordingly
export GCS_BUCKET=""  # name only
 
export DRIVER_CLASS=com.mysql.jdbc.Driver
export CONNECT_STRING="jdbc:..."
export USERNAME=
 export PASSWORD=""  # testing only - use password-file
export TABLE=
 
export JDBC_JAR=gs://${GCS_BUCKET}/sqoop/jars/mysql-connector-java-5.0.8-bin.jar
 
export SQOOP_JAR=gs://${GCS_BUCKET}/sqoop/jars/cloudera/sqoop-1.4.7.7.2.10.0-148.jar
export AVRO_JAR1=gs://${GCS_BUCKET}/sqoop/jars/cloudera/avro-1.8.2.7.2.10.0-148.jar
export AVRO_JAR2=gs://${GCS_BUCKET}/sqoop/jars/cloudera/avro-mapred-1.8.2.7.2.10.0-148.jar
export PARQUET_JAR1=gs://${GCS_BUCKET}/sqoop/jars/cloudera/kite-data-core-1.0.0-cdh6.3.4.jar
export PARQUET_JAR2=gs://${GCS_BUCKET}/sqoop/jars/cloudera/kite-data-mapreduce-1.0.0-cdh6.3.4.jar
export PARQUET_JAR3=gs://${GCS_BUCKET}/sqoop/jars/cloudera/kite-hadoop-compatibility-1.0.0-cdh6.3.4.jar
export PARQUET_JAR4=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-common-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR5=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-avro-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR6=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-hadoop-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR7=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-column-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR8=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-encoding-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR9=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-jackson-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR10=gs://${GCS_BUCKET}/sqoop/jars/parquet-format-2.9.0.jar
 
gcloud dataproc jobs submit hadoop \
--cluster=${CLUSTER_NAME} \
--class=org.apache.sqoop.Sqoop \
--region=${CLUSTER_REGION} \
--jars=${JDBC_JAR},${SQOOP_JAR},${AVRO_JAR1},${AVRO_JAR2},${PARQUET_JAR1},${PARQUET_JAR2},${PARQUET_JAR3},${PARQUET_JAR4},${PARQUET_JAR5},${PARQUET_JAR6},${PARQUET_JAR7},${PARQUET_JAR8},${PARQUET_JAR9},${PARQUET_JAR10} \
-- import \
-Dmapreduce.job.user.classpath.first=true \
-Dparquetjob.configurator.implementation=hadoop \
--driver ${DRIVER_CLASS} \
--connect=${CONNECT_STRING} \
--username=${USERNAME} \
--password=${PASSWORD} \
--target-dir="gs://${GCS_BUCKET}/sqoop/out/parquet_output4/" \
--table=${TABLE} \
--delete-target-dir \
--as-parquetfile \
-m 1 \
--verbose
 
# --parquet-configurator-implementation kite \
#--compression-codec snappy \
#--query=""

Upvotes: 0

GenericDisplayName
GenericDisplayName

Reputation: 463

Ran into a similar issue I was able to use the following resources to change the dependency version for kite-sdk to a newer version. I had to download the latest jars for kite-sdk.

https://community.cloudera.com/t5/Support-Questions/Issue-when-using-parquet-org-kitesdk-data/td-p/128233

https://discuss.cloudxlab.com/t/sqoop-import-to-hive-as-parquet-file-is-failing/1089/6

Upvotes: 0

Related Questions