Best separator character for Hadoop files

Question

If I'm writing csv style files out of a system to be consumed by Hadoop. What is the best column separator to use within the file? I have tried ctrl-A but it's a pain imo because other programs don't necessarily show it, eg I might view the file using vi, notepad, web browser, excel. Comma is a pain because the data might also contains commas. I was thinking of standardising on tab. Is there a best practice for this in regards to Hadoop or doesn't it matter. I have done a fair bit of searching and can't find much on this fairly basic question.

Joe K · Accepted Answer

There are certainly tradeoffs to each. It really depends what you care most about.

Commas- if you care about interoperability. Every tool works with CSV. commas in the data are a pain only if the writing system doesn't escape properly, or the reading system doesn't respect the escaping. Hive handles escaping correctly, as far as I know.

Tabs- if you care about interoperability and expect commas in data but no tabs. You're slightly less likely to have tabs in the data, but also slightly less likely that any given tool supports TSV.

Ctrl+A- if you care only about hadoop-ecosystem functionality. This has definitely become the de-facto hadoop standard, but hadoop also easily supports commas and tabs. Upside is you usually don't have to care about escaping.

In the end, I think it's usually a toss-up, assuming you're escaping correctly (and you should be!). There's no best practice. If you find yourself worrying a lot about this kind of thing, you might also want to step up to a more serious serialization format, like Avro, which is very well-supported in Hadoop-world.

Best separator character for Hadoop files

Answers (1)

Related Questions