Blaubaer
Blaubaer

Reputation: 664

Best practice: how to handle data records with changing "schema" / "columns"

This is a best-practice question.

Our set up is a hadoop cluster, storing (log) data in hdfs. We get the data in csv format, one file every day. Running MR jobs in hadoop on these files is fine, as long as the "schema" of the file, especially the number of columns, does not change.

However, we are facing the problem that the log records we want to analyze eventually change, in the sense that columns might be added or removed. I was wondering if some of you would be willing to share your best-practices for these type of situations. The best way we can think of at the moment is to store the data not as csv, but in json format. However, this will increase (at least double) the required storage space. We also came along Apache Avro and Apache Parquet, and just started looking into this.

Any ideas and comments on this issue are more than welcome.

Upvotes: 1

Views: 681

Answers (1)

KrazyGautam
KrazyGautam

Reputation: 2692

use Thrift and use elephant bird (twitter lib ) for using the related file input/output format .

Upvotes: 1

Related Questions