Manikandan Kannan
Manikandan Kannan

Reputation: 9014

Sqoop import - Source table schema change

Let's say that there is a table called T1 with 100+ columns in any relational database. I sqoop import this table into HDFS as CSV.

Now 10 more columns are added to the table T1. If i import this data into HDFS, the new data would have 10 more columns than before.

Questions:

  1. How does sqoop order the columns being imported, so that the old and the new data (at least for the columns before the change in T1) are at the right positions?

  2. With new columns, do these columns always get imported at the end?

  3. What if a column gets deleted? How to handle this situation i.e. how does the old data and the new data retain the positions?

Upvotes: 0

Views: 3329

Answers (1)

Durga Viswanath Gadiraju
Durga Viswanath Gadiraju

Reputation: 3956

How does sqoop order the columns being imported, so that the old and the new data (at least for the columns before the change in T1) are at the right positions?

All Hadoop based tools does not enforce schema while writing the data to HDFS. By default it will not try to update the old data with new fields. Sqoop is unaware of the columns of the data in HDFS. For new data, it all depends up on how you write sqoop import command. If you use --table clause with out --columns clause, then the data will be as per the order on the source. If you issue --query clause to provide custom query to fetch the data, the order will be based up on the column order of select clause in the query. If you do not want to explicitly mention column names as part of sqoop import, you can consider creating view on source database.

With new columns, do these columns always get imported at the end?

Not necessarily as I have explained previously

What if a column gets deleted? How to handle this situation i.e. how does the old data and the new data retain the positions?

If columns are deleted, most likely you have to reload the data or handle it at the time of processing based up on certain rules. Better approach is to reload the data or to create the view on the source database.

These are not the limitations of sqoop it self, they are standard problems which require custom solution irrespective of the technology you are using. Problem is too generic and hence getting an API for it might not be feasible.

Upvotes: 2

Related Questions