Jonathan Myers
Jonathan Myers

Reputation: 930

Do avro and parquet formatted data have to be written within a hadoop infrastructure?

I've been researching the pros and cons of using avro, parquet, and other data sources for a project. If I am receiving input data from other groups of people who do not operate using Hadoop, will they be able to provide this input data in avro/parquet format? My reading so far on these formats has only been within the sphere of the Hadoop infrastructure, so I am wondering how difficult it would be for folks who just use Oracle/SQL to provide data in this format.

Upvotes: 1

Views: 109

Answers (1)

Zoltan
Zoltan

Reputation: 3105

It is possible to use these formats without Hadoop, but the ease of doing so depends on the language binding.

For example, reading/writing Parquet files on standalone machines may be very cumbersome with the Java language binding (which is even called parquet-mr where mr stands for MapReduce), as it builds heavily on Hadoop classes. These are typically provided on the classpath of a Hadoop cluster, but are less readily available on standalone machines.

(While parquet-mr is mainly a Java library, it also contains some tools that users may want to run on their local machine. To work around this issue, the parquet-tools module of parquet-mr contains a compilation profile called local that packages Hadoop dependencies alongside the compiled tool. However, this only applies to parquet-tools and you have to compile it yourself to make a local build.)

The python language binding, on the other hand, is very easy to set up and works fine on standalone machines as well. You can either use the high-level pandas interface or the actual implementations pyarrow/fastparquet directly.

Upvotes: 3

Related Questions