Reputation: 329
I am importing from Oracle using Sqoop1 (version 1.4.6, CDH 5.7.4).
Sqoop creates a temporary Parquet Hive Table, then I use Hive (beeline) to insert into
the destination table.
This double-step ingestion is due to the fact that the target table has got a different column-types mapping from Sqoop's one, so with insert into
I can convert them on the fly. I am also compressing in Snappy on the fly. This works correctly. At the moment, I am not controlling the type mapping on Sqoop's import command. It is too complicated in my particular case.
In the beginning I was importing in flatfile, but I have fields containing a bunch of characters that break my rows like newlines, carriage returns, tabs, and so on. For this and other reasons I decided to move to Parquet (avro unfortunately is not an option, since it is not supported by a tool we use).
In my mind, a binary format like Parquet wouldn't have had problems handling those characters.
Turns out I was wrong.
A select * from table
shows some corrupted lines and after a lot of debug, I found that some fields were broken in 2 parts. I am able to see a record that is truncated (according to the Oracle source), and part of it is in another line (alone).
Since I'm using a freeform query in Sqoop, the solution was to replace the chars upon extraction, using the replace function REPLACE(REPLACE(REPLACE(note, chr(10), ' '), chr(13), ' '), chr(09), ' ') AS NOTE
.
Of course this is clearly the wrong approach, since there can be other fields with dirty chars, and there can be other dirty chars, also.
So the questions are:
Thanks
Upvotes: 1
Views: 1112
Reputation: 419
Answers:
regexp_replace(your_text_column, "\t|\n|\r\n", " ")
. This would prevent your client's "confusion".Additionally, you dont have to use parquet for the data containing such specific characters. I think that hive's default delimiter (\001) is enough for that. That non-printing character is very less likely to appear in a text field.
Upvotes: 1