Reputation: 6751
I have several machines with TBs of log data in a custom format which can be read with a c++ library. I want to upload all data to hadoop cluster (HDFS) while converting it to parquet files.
This is an on going process (meaning every day I will get more data) and not a one time effort.
What is best alternative to do it performance wise (doing it efficiently)?
Is the parquet C++ library as good as the Java one? (updates, bugs, etc.)
The solution should handle tens of TBs per day or even more in the future.
Log data arrives on going and should be available immediately on HDFS cluster.
Upvotes: 1
Views: 147
Reputation: 8826
Performance-wise, your best approach will be to gather the data in batches and then write out a new Parquet file per batch. If your data is received in single lines and you want to persist them immediately on HDFS, then you could also write them out to a row-based format (that supports single line appends), e.g. AVRO and run regulary a job that compacts them into a single Parquet file.
Library-wise, parquet-cpp is much more in active development at the moment then parquet-mr (the Java library). This is mainly due to the fact that active parquet-cpp development (re-)started about 1.5 years ago (winter/spring 2016). So updates to the C++ library will happen very quickly at the moment while the Java library is very mature as it has a huge userbase since quite some years. There are some features like predicate pushdown that are not yet implemented in parquet-cpp but these all on the read path, so for write they don't matter.
We now at a point with parquet-cpp, that it already runs very stable in different productive environments, so in the end, your choice of using the C++ or Java library should mainly depend on our system environment. If all your code is currently running in the JVM, than use parquet-mr, otherwise, if you're a C++/Python/Ruby user, use parquet-cpp.
Disclaimer: I'm one of the parquet-cpp developers.
Upvotes: 1