Reputation: 159
I am trying to import Oracle DB data into HDFS using Sqoop import free-form query by joining two tables using '--as-avrodatafile' using Oozie scheduler. Following is the content of my workflow.xml:
<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns="uri:oozie:workflow:0.2" name="sqoop-freeform-wf">
<start to="sqoop-freeform-node"/>
<action name="sqoop-freeform-node">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="/apps/hive/warehouse/loc_avro_import"/>
</prepare>
<arg>import</arg>
<arg>--connect</arg>
<arg>jdbc:oracle:thin:@connection-string:1521:ORCL</arg>
<arg>--username</arg>
<arg>comcast</arg>
<arg>--password</arg>
<arg>comcast123</arg>
<arg>--query</arg>
<arg>select location.location_id, location.street1,location_meta.display_name from location join location_meta on location.location_id=location_meta.location_id WHERE $CONDITIONS</arg>
<arg>--target-dir</arg>
<arg>/apps/hive/warehouse/loc_avro_import</arg>
<arg>--split-by</arg>
<arg>location.location_id</arg>
<arg>--as-avrodatafile</arg>
<arg>-m</arg>
<arg>1</arg>
</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Sqoop free form failed</message>
</kill>
<end name="end"/>
</workflow-app>
The Oozie job runs successfully and also creates an Avro file on HDFS under directory /apps/hive/warehouse/loc_avro_import along with _SUCCESS flag. Then I create an external table on this path using following Hive script:
CREATE external TABLE avro_location(LOCATION_ID string, STREET1 string, DISPLAY_NAME string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
location
'/apps/hive/warehouse/loc_avro_import';
The table also is created successfully, but when I try to retry the records using Hive shell it returns same number of rows which it returned when I executed free-form query in Oracle. But data comes as NULL for all the rows. I also tried to see the INFO logs of hive using following command:
hive --hiveconf hive.root.logger=INFO,console
Below is the output that I get:
hive> select * from avro_location;
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=Driver.run from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=TimeToSubmit from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=compile from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=parse from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO parse.ParseDriver: Parsing command: select * from avro_location
15/10/23 15:12:02 [main]: INFO parse.ParseDriver: Parse Completed
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=parse start=1445627522004 end=1445627522004 duration=0 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Starting Semantic Analysis
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Completed phase 1 of Semantic Analysis
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Get metadata for source tables
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Get metadata for subqueries
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Get metadata for destination tables
15/10/23 15:12:02 [main]: INFO ql.Context: New scratch dir is hdfs://sandbox.hortonworks.com:8020/tmp/hive/root/061a4722-0a70-4c28-8b5c-1bf82b63d09f/hive_2015-10-23_15-12-02_004_2341151357389322335-1
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Completed getting MetaData in Semantic Analysis
15/10/23 15:12:02 [main]: INFO parse.BaseSemanticAnalyzer: Not invoking CBO because the statement has too few joins
15/10/23 15:12:02 [main]: INFO avro.AvroSerDe: columnComments is
15/10/23 15:12:02 [main]: INFO avro.AvroSerDe: Avro schema is {"type":"record","name":"avro_location","namespace":"default","fields":[{"name":"location_id","type":["null","string"],"default":null},{"name":"street1","type":["null","string"],"default":null},{"name":"display_name","type":["null","string"],"default":null}]}
15/10/23 15:12:02 [main]: INFO common.FileUtils: Creating directory if it doesn't exist: hdfs://sandbox.hortonworks.com:8020/tmp/hive/root/061a4722-0a70-4c28-8b5c-1bf82b63d09f/hive_2015-10-23_15-12-02_004_2341151357389322335-1/-mr-10000/.hive-staging_hive_2015-10-23_15-12-02_004_2341151357389322335-1
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Set stats collection dir : hdfs://sandbox.hortonworks.com:8020/tmp/hive/root/061a4722-0a70-4c28-8b5c-1bf82b63d09f/hive_2015-10-23_15-12-02_004_2341151357389322335-1/-mr-10000/.hive-staging_hive_2015-10-23_15-12-02_004_2341151357389322335-1/-ext-10002
15/10/23 15:12:02 [main]: INFO ppd.OpProcFactory: Processing for FS(2)
15/10/23 15:12:02 [main]: INFO ppd.OpProcFactory: Processing for SEL(1)
15/10/23 15:12:02 [main]: INFO ppd.OpProcFactory: Processing for TS(0)
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Completed plan generation
15/10/23 15:12:02 [main]: INFO ql.Driver: Semantic Analysis Completed
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=semanticAnalyze start=1445627522005 end=1445627522040 duration=35 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO avro.AvroSerDe: columnComments is
15/10/23 15:12:02 [main]: INFO avro.AvroSerDe: Avro schema is {"type":"record","name":"avro_location","namespace":"default","fields":[{"name":"location_id","type":["null","string"],"default":null},{"name":"street1","type":["null","string"],"default":null},{"name":"display_name","type":["null","string"],"default":null}]}
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: Initializing operator TS[0]
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: Initialization Done 0 TS done is reset.
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: Operator 0 TS initialized
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: Initializing children of 0 TS
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: Initializing child 1 SEL
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: Initializing operator SEL[1]
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: SELECT struct<location_id:string,street1:string,display_name:string>
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: Initialization Done 1 SEL done is reset.
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: Operator 1 SEL initialized
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: Initializing children of 1 SEL
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: Initializing child 3 OP
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: Initializing operator OP[3]
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: Initialization Done 3 OP done is reset.
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: Operator 3 OP initialized
15/10/23 15:12:02 [main]: INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:avro_location.location_id, type:string, comment:null), FieldSchema(name:avro_location.street1, type:string, comment:null), FieldSchema(name:avro_location.display_name, type:string, comment:null)], properties:null)
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=compile start=1445627522003 end=1445627522041 duration=38 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=acquireReadWriteLocks from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO lockmgr.DbTxnManager: Setting lock request transaction to txnid:0 for queryId=root_20151023151202_5e68efe1-1176-485b-9014-301c99198012
15/10/23 15:12:02 [main]: INFO lockmgr.DbLockManager: Requesting: queryId=root_20151023151202_5e68efe1-1176-485b-9014-301c99198012 LockRequest(component:[LockComponent(type:SHARED_READ, level:TABLE, dbname:default, tablename:avro_location)], txnid:0, user:root, hostname:ip-sandbox.hortonworks.com)
15/10/23 15:12:02 [main]: INFO lockmgr.DbLockManager: Response to queryId=root_20151023151202_5e68efe1-1176-485b-9014-301c99198012 LockResponse(lockid:78, state:ACQUIRED)
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=acquireReadWriteLocks start=1445627522041 end=1445627522050 duration=9 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=Driver.execute from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO ql.Driver: Starting command(queryId=root_20151023151202_5e68efe1-1176-485b-9014-301c99198012): select * from avro_location
15/10/23 15:12:02 [main]: INFO hooks.ATSHook: Created ATS Hook
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=PreHook.org.apache.hadoop.hive.ql.hooks.ATSHook from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=PreHook.org.apache.hadoop.hive.ql.hooks.ATSHook start=1445627522050 end=1445627522050 duration=0 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=TimeToSubmit start=1445627522003 end=1445627522050 duration=47 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=runTasks from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=runTasks start=1445627522051 end=1445627522051 duration=0 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO hooks.ATSHook: Created ATS Hook
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=PostHook.org.apache.hadoop.hive.ql.hooks.ATSHook from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=PostHook.org.apache.hadoop.hive.ql.hooks.ATSHook start=1445627522051 end=1445627522051 duration=0 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=Driver.execute start=1445627522050 end=1445627522051 duration=1 from=org.apache.hadoop.hive.ql.Driver>
OK
15/10/23 15:12:02 [main]: INFO ql.Driver: OK
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=releaseLocks start=1445627522052 end=1445627522118 duration=66 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=Driver.run start=1445627522003 end=1445627522118 duration=115 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO mapred.FileInputFormat: Total input paths to process : 1
15/10/23 15:12:02 [main]: INFO avro.AvroGenericRecordReader: Found the avro schema in the job: {"type":"record","name":"avro_location","namespace":"default","fields":[{"name":"location_id","type":["null","string"],"default":null},{"name":"street1","type":["null","string"],"default":null},{"name":"display_name","type":["null","string"],"default":null}]}
15/10/23 15:12:02 [main]: INFO avro.AvroDeserializer: Adding new valid RRID :678ef2a1:150961947b5:-7fff
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: 0 finished. closing...
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: 1 finished. closing...
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: 3 finished. closing...
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: 3 Close done
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: 1 Close done
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: 0 Close done
Time taken: 0.115 seconds, Fetched: 30 row(s)
15/10/23 15:12:02 [main]: INFO CliDriver: Time taken: 0.115 seconds, Fetched: 30 row(s)
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=releaseLocks start=1445627522136 end=1445627522136 duration=0 from=org.apache.hadoop.hive.ql.Driver>
As mentioned above it fetches 30 rows but all the values are NULL. Can anyone please help how can I resolve this issue.
Upvotes: 2
Views: 1761
Reputation: 1
SQOOP Import generte the Avsc file in local.. copy it to schema folder. Query your table.
Upvotes: 0
Reputation: 1340
I think you need to specify avro schema location for the table like this: TBLPROPERTIES ('avro.schema.url'='/apps/hive/warehouse/loc_avro_import/schemaname.avsc');
Upvotes: 0