Omar
Omar

Reputation: 329

Dirty values in Parquet files when importing with Sqoop

I am importing from Oracle using Sqoop1 (version 1.4.6, CDH 5.7.4).
Sqoop creates a temporary Parquet Hive Table, then I use Hive (beeline) to insert intothe destination table.
This double-step ingestion is due to the fact that the target table has got a different column-types mapping from Sqoop's one, so with insert into I can convert them on the fly. I am also compressing in Snappy on the fly. This works correctly. At the moment, I am not controlling the type mapping on Sqoop's import command. It is too complicated in my particular case.

In the beginning I was importing in flatfile, but I have fields containing a bunch of characters that break my rows like newlines, carriage returns, tabs, and so on. For this and other reasons I decided to move to Parquet (avro unfortunately is not an option, since it is not supported by a tool we use).
In my mind, a binary format like Parquet wouldn't have had problems handling those characters.

Turns out I was wrong.
A select * from table shows some corrupted lines and after a lot of debug, I found that some fields were broken in 2 parts. I am able to see a record that is truncated (according to the Oracle source), and part of it is in another line (alone). Since I'm using a freeform query in Sqoop, the solution was to replace the chars upon extraction, using the replace function REPLACE(REPLACE(REPLACE(note, chr(10), ' '), chr(13), ' '), chr(09), ' ') AS NOTE.

Of course this is clearly the wrong approach, since there can be other fields with dirty chars, and there can be other dirty chars, also.

So the questions are:

  1. Am I wrong thinking that Parquet (or avro, or another binary format) would not care about chars inside fields?
  2. Am I doing something wrong?
  3. How can I prevent this kind of problem?

Thanks

Upvotes: 1

Views: 1112

Answers (1)

HakkiBuyukcengiz
HakkiBuyukcengiz

Reputation: 419

Answers:

  1. You are wrong. Parquet is not affected from the insider delimiter characters such as newline or tab etc.
  2. No. In fact, your data is stored in parquet file just same as in oracle. However; while you are printing the data to the screen, depending on your client (Assuming you use hive client), you see the results as broken because it is likely that the server sends the data to the client in clear text.
  3. To prevent this, while you print the data, you can use hive's function: regexp_replace(your_text_column, "\t|\n|\r\n", " "). This would prevent your client's "confusion".

Additionally, you dont have to use parquet for the data containing such specific characters. I think that hive's default delimiter (\001) is enough for that. That non-printing character is very less likely to appear in a text field.

Upvotes: 1

Related Questions