John Berchmans
John Berchmans

Reputation: 1

Why "Sqoop import --query ..." inserts multiple rows when only one row should be inserted?

I am getting 'Hands-on-Experience' with hadoop, sqoop, pig, flume, ...

In my local mysql schema I have a table called Employee with following structure:

`emp_id` int(11) NOT NULL AUTO_INCREMENT
`first_name` varchar(30) NOT NULL
`last_name` varchar(30) NOT NULL
`create_date` datetime NOT NULL

Employee table had four rows.

I ran the following sqoop command:

sqoop --options-file import.txt \
--query "select 1 as emp_id, 'Barry' as first_name, 'Williams' as last_name, '2016-04-20 15:41:00' as create_date from test.Employee where \$CONDITIONS" \
--target-dir /user/<username>/Employee  \
--split-by emp_id \
-m 1

In the sqoop command select ... has data for only one row. So, only one row should be inserted.

Result of the sqoop command:

When I ran following command:

hdfs dfs -cat /user/<username>/Employee/part-m-00000

I got this:

1,Barry,Williams,2016-04-20 15:41:00
1,Barry,Williams,2016-04-20 15:41:00
1,Barry,Williams,2016-04-20 15:41:00
1,Barry,Williams,2016-04-20 15:41:00

Questions:

1) Why were four rows inserted instead of one?
2) Is it because there were four rows in the table when the `sqoop` command ran? 
3) Is this a bug?

Thanks in advance.

Upvotes: 0

Views: 1347

Answers (4)

Samir Anand
Samir Anand

Reputation: 47

No, it's not a bug. You are querying in wrong way.
You need to add LIMIT to your sql query. Updated query will look like following:

sqoop --options-file import.txt \
--query "select 1 as emp_id, 'Barry' as first_name, 'Williams' as last_name, '2016-04-20 15:41:00' as create_date from test.Employee  LIMIT 1 where \$CONDITIONS" \
--target-dir /user/<username>/Employee  \
--split-by emp_id \
-m 1

Upvotes: 2

user6509886
user6509886

Reputation: 1

Sqoop is working fine. Try to run this query against the database and you will see the output will be equal to the number of rows in that table.

Upvotes: 0

Ashok
Ashok

Reputation: 75

I dont know why you are getting 4 records. I am getting 1 record only in my system. Please add Limit 1 at the end of select... query after WHERE $CONDITIONS and see. hope this may work

Upvotes: 0

gkc123
gkc123

Reputation: 522

I am not sure if this is a bug but this is interesting, I never tried to execute sqoop command that way.

--split-by the specified column (primary key) is used by sqoop to split work units.

-m 1 is forcing sqoop to use only 1 mapper.

You have a free form query import, based on the query sqoop should create only 1 row. My assumption is that you have both --split-by & -m 1 options/arguments passed to the scoop; maybe --split-by takes precedence over -m. Usually sqoop executes with 4 mappers only job when -m are not specified, I guess each mapper created 1 row of hardcoded fields in the sql statement.

Try your sqoop command without the --split-by argument.

Upvotes: 0

Related Questions