Reputation: 71
When we run sqoop import in the GCP dataproc clusters to either avrodatafile or parquetfile it fails with the below errors. However, import to textfile works. Feels like we might need some additional JARs are required.
The required Sqoop jars are loaded from GCS.
COMMAND used:
gcloud dataproc jobs submit hadoop \
--cluster={cluster_name} \
--region=us-central1 \
--class=org.apache.sqoop.Sqoop --jars={sqoop_jars_gcs}/sqoop-1.4.7.jar,{sqoop_jars_gcs}/avro-1.8.2.jar,{sqoop_jars_gcs}/terajdbc4.jar,{sqoop_jars_gcs}/log4j-1.2.17.jar,{sqoop_jars_gcs}/sqoop-connector-teradata-1.2c5.jar,{sqoop_jars_gcs}/tdgssconfig.jar,{sqoop_jars_gcs}/avro-1.8.2.jar \
-- import \
-Dmapreduce.job.user.classpath.first=true \
-Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect={db_connection}DATABASE={source_db} \
--username={userid} \
--password-file {passfile} \
--driver com.teradata.jdbc.TeraDriver \
-e "sql query AND \$CONDITIONS" \
--target-dir=<dir> \
--delete-target-dir \
--as-<avrodatafile/parquetfile> \
--split-by <column>
Error when running --as-avrodatafile: We have the avro-1.8.2.jar in classpath but still no luck.
INFO - Error: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
INFO - at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:135)
INFO - at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:753)
INFO - at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
INFO - at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
INFO - at java.security.AccessController.doPrivileged(Native Method)
INFO - at javax.security.auth.Subject.doAs(Subject.java:422)
INFO - at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
INFO - at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
INFO - Caused by: java.lang.reflect.InvocationTargetException
INFO - at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
INFO - at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
INFO - at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
INFO - at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
INFO - at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
INFO - ... 7 more
INFO - Caused by: java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
INFO - at org.apache.sqoop.mapreduce.AvroImportMapper.<init>(AvroImportMapper.java:43)
INFO - ... 12 more
INFO - Caused by: java.lang.ClassNotFoundException: org.apache.avro.mapred.AvroWrapper
INFO - at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
INFO - at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
INFO - at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
INFO - at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
INFO - ... 13 more
Error when running --as-parquetfile:
INFO - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
INFO - at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
INFO - at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
INFO - at java.lang.reflect.Method.invoke(Method.java:498)
INFO - at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)
INFO - Caused by: java.lang.NoClassDefFoundError: org/kitesdk/data/mapreduce/DatasetKeyOutputFormat
INFO - at org.apache.sqoop.mapreduce.DataDrivenImportJob.getOutputFormatClass(DataDrivenImportJob.java:213)
INFO - at org.apache.sqoop.mapreduce.ImportJobBase.configureOutputFormat(ImportJobBase.java:98)
INFO - at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:263)
INFO - at org.apache.sqoop.manager.SqlManager.importQuery(SqlManager.java:748)
INFO - at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:522)
INFO - at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:628)
INFO - at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
INFO - at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
INFO - at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
INFO - at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
INFO - at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
INFO - at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
INFO - ... 5 more
INFO - Caused by: java.lang.ClassNotFoundException: org.kitesdk.data.mapreduce.DatasetKeyOutputFormat
INFO - at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
INFO - at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
INFO - at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
INFO - at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
INFO - ... 17 more
Upvotes: 1
Views: 892
Reputation: 3354
This worked for me only when using a very specific set of jar versions, mostly from Cloudera as follows -
# Jars used:
# https://repo1.maven.org/maven2/org/apache/parquet/parquet-format/2.9.0/parquet-format-2.9.0.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/sqoop/sqoop/1.4.7.7.2.10.0-148/sqoop-1.4.7.7.2.10.0-148.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/kitesdk/kite-data-core/1.0.0-cdh6.3.4/kite-data-core-1.0.0-cdh6.3.4.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/kitesdk/kite-data-mapreduce/1.0.0-cdh6.3.4/kite-data-mapreduce-1.0.0-cdh6.3.4.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/kitesdk/kite-hadoop-compatibility/1.0.0-cdh6.3.4/kite-hadoop-compatibility-1.0.0-cdh6.3.4.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/avro/avro/1.8.2.7.2.10.0-148/avro-1.8.2.7.2.10.0-148.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/avro/avro-mapred/1.8.2.7.2.10.0-148/avro-mapred-1.8.2.7.2.10.0-148.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-avro/1.10.99.7.2.10.0-148/parquet-avro-1.10.99.7.2.10.0-148.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-common/1.10.99.7.2.10.0-148/parquet-common-1.10.99.7.2.10.0-148.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-column/1.10.99.7.2.10.0-148/parquet-column-1.10.99.7.2.10.0-148.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-hadoop/1.10.99.7.2.10.0-148/parquet-hadoop-1.10.99.7.2.10.0-148.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-jackson/1.10.99.7.2.10.0-148/parquet-jackson-1.10.99.7.2.10.0-148.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-encoding/1.10.99.7.2.10.0-148/parquet-encoding-1.10.99.7.2.10.0-148.jar
export CLUSTER_NAME=
export CLUSTER_REGION=us-central1 # update accordingly
export GCS_BUCKET="" # name only
export DRIVER_CLASS=com.mysql.jdbc.Driver
export CONNECT_STRING="jdbc:..."
export USERNAME=
export PASSWORD="" # testing only - use password-file
export TABLE=
export JDBC_JAR=gs://${GCS_BUCKET}/sqoop/jars/mysql-connector-java-5.0.8-bin.jar
export SQOOP_JAR=gs://${GCS_BUCKET}/sqoop/jars/cloudera/sqoop-1.4.7.7.2.10.0-148.jar
export AVRO_JAR1=gs://${GCS_BUCKET}/sqoop/jars/cloudera/avro-1.8.2.7.2.10.0-148.jar
export AVRO_JAR2=gs://${GCS_BUCKET}/sqoop/jars/cloudera/avro-mapred-1.8.2.7.2.10.0-148.jar
export PARQUET_JAR1=gs://${GCS_BUCKET}/sqoop/jars/cloudera/kite-data-core-1.0.0-cdh6.3.4.jar
export PARQUET_JAR2=gs://${GCS_BUCKET}/sqoop/jars/cloudera/kite-data-mapreduce-1.0.0-cdh6.3.4.jar
export PARQUET_JAR3=gs://${GCS_BUCKET}/sqoop/jars/cloudera/kite-hadoop-compatibility-1.0.0-cdh6.3.4.jar
export PARQUET_JAR4=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-common-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR5=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-avro-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR6=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-hadoop-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR7=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-column-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR8=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-encoding-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR9=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-jackson-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR10=gs://${GCS_BUCKET}/sqoop/jars/parquet-format-2.9.0.jar
gcloud dataproc jobs submit hadoop \
--cluster=${CLUSTER_NAME} \
--class=org.apache.sqoop.Sqoop \
--region=${CLUSTER_REGION} \
--jars=${JDBC_JAR},${SQOOP_JAR},${AVRO_JAR1},${AVRO_JAR2},${PARQUET_JAR1},${PARQUET_JAR2},${PARQUET_JAR3},${PARQUET_JAR4},${PARQUET_JAR5},${PARQUET_JAR6},${PARQUET_JAR7},${PARQUET_JAR8},${PARQUET_JAR9},${PARQUET_JAR10} \
-- import \
-Dmapreduce.job.user.classpath.first=true \
-Dparquetjob.configurator.implementation=hadoop \
--driver ${DRIVER_CLASS} \
--connect=${CONNECT_STRING} \
--username=${USERNAME} \
--password=${PASSWORD} \
--target-dir="gs://${GCS_BUCKET}/sqoop/out/parquet_output4/" \
--table=${TABLE} \
--delete-target-dir \
--as-parquetfile \
-m 1 \
--verbose
# --parquet-configurator-implementation kite \
#--compression-codec snappy \
#--query=""
Upvotes: 0
Reputation: 463
Ran into a similar issue I was able to use the following resources to change the dependency version for kite-sdk to a newer version. I had to download the latest jars for kite-sdk.
https://discuss.cloudxlab.com/t/sqoop-import-to-hive-as-parquet-file-is-failing/1089/6
Upvotes: 0