Reputation: 930
I've been researching the pros and cons of using avro, parquet, and other data sources for a project. If I am receiving input data from other groups of people who do not operate using Hadoop, will they be able to provide this input data in avro/parquet format? My reading so far on these formats has only been within the sphere of the Hadoop infrastructure, so I am wondering how difficult it would be for folks who just use Oracle/SQL to provide data in this format.
Upvotes: 1
Views: 109
Reputation: 3105
It is possible to use these formats without Hadoop, but the ease of doing so depends on the language binding.
For example, reading/writing Parquet files on standalone machines may be very cumbersome with the Java language binding (which is even called parquet-mr
where mr stands for MapReduce), as it builds heavily on Hadoop classes. These are typically provided on the classpath of a Hadoop cluster, but are less readily available on standalone machines.
(While parquet-mr
is mainly a Java library, it also contains some tools that users may want to run on their local machine. To work around this issue, the parquet-tools
module of parquet-mr
contains a compilation profile called local
that packages Hadoop dependencies alongside the compiled tool. However, this only applies to parquet-tools
and you have to compile it yourself to make a local build.)
The python language binding, on the other hand, is very easy to set up and works fine on standalone machines as well. You can either use the high-level pandas interface or the actual implementations pyarrow/fastparquet directly.
Upvotes: 3