Reputation: 1
I am getting 'Hands-on-Experience' with hadoop, sqoop, pig, flume, ...
In my local mysql
schema I have a table called Employee
with following structure:
`emp_id` int(11) NOT NULL AUTO_INCREMENT
`first_name` varchar(30) NOT NULL
`last_name` varchar(30) NOT NULL
`create_date` datetime NOT NULL
Employee table had four rows.
I ran the following sqoop command:
sqoop --options-file import.txt \
--query "select 1 as emp_id, 'Barry' as first_name, 'Williams' as last_name, '2016-04-20 15:41:00' as create_date from test.Employee where \$CONDITIONS" \
--target-dir /user/<username>/Employee \
--split-by emp_id \
-m 1
In the sqoop
command select ...
has data for only one row.
So, only one row should be inserted.
Result of the sqoop
command:
When I ran following command:
hdfs dfs -cat /user/<username>/Employee/part-m-00000
I got this:
1,Barry,Williams,2016-04-20 15:41:00
1,Barry,Williams,2016-04-20 15:41:00
1,Barry,Williams,2016-04-20 15:41:00
1,Barry,Williams,2016-04-20 15:41:00
Questions:
1) Why were four rows inserted instead of one?
2) Is it because there were four rows in the table when the `sqoop` command ran?
3) Is this a bug?
Thanks in advance.
Upvotes: 0
Views: 1347
Reputation: 47
No, it's not a bug. You are querying in wrong way.
You need to add LIMIT to your sql query. Updated query will look like following:
sqoop --options-file import.txt \
--query "select 1 as emp_id, 'Barry' as first_name, 'Williams' as last_name, '2016-04-20 15:41:00' as create_date from test.Employee LIMIT 1 where \$CONDITIONS" \
--target-dir /user/<username>/Employee \
--split-by emp_id \
-m 1
Upvotes: 2
Reputation: 1
Sqoop is working fine. Try to run this query against the database and you will see the output will be equal to the number of rows in that table.
Upvotes: 0
Reputation: 75
I dont know why you are getting 4 records. I am getting 1 record only in my system. Please add Limit 1 at the end of select... query after WHERE $CONDITIONS and see. hope this may work
Upvotes: 0
Reputation: 522
I am not sure if this is a bug but this is interesting, I never tried to execute sqoop command that way.
--split-by
the specified column (primary key) is used by sqoop to split work units.
-m 1
is forcing sqoop to use only 1 mapper.
You have a free form query import, based on the query sqoop should create only 1 row. My assumption is that you have both --split-by
& -m 1
options/arguments passed to the scoop; maybe --split-by
takes precedence over -m
. Usually sqoop executes with 4 mappers only job when -m
are not specified, I guess each mapper created 1 row of hardcoded fields in the sql statement.
Try your sqoop command without the --split-by
argument.
Upvotes: 0