srmark
srmark

Reputation: 8162

Alternative to CSV?

I intend to build a RESTful service which will return a custom text format. Given my very large volumes of data, XML/JSON is too verbose. I'm looking for a row based text format.

CSV is an obvious candidate. I'm however wondering if there isn't something better out there. The only I've found through a bit of research is CTX and Fielded Text.

I'm looking for a format which offers the following:

Fielded text is looking pretty good and I could definitely build a specification myself, but I'm curious to know what others have done given that this must be a very old problem. It's surprising that there isn't a better standard out there.

What suggestions do you have?

Upvotes: 13

Views: 10984

Answers (5)

Jo van Schalkwyk
Jo van Schalkwyk

Reputation: 171

Looking through the existing answers, most struck me as a bit dated. Especially in terms of 'big data', noteworthy alternatives to CSV include:

  • ORC : 'Optimised Row Columnar' uses row storage, useful in Python/Pandas. Originated in HIVE, optimised by Hortonworks. Schema is in the footer. The Wikipedia entry is currently quite terse https://en.wikipedia.org/wiki/Apache_ORC but Apache has a lot of detail.

  • Parquet : Similarly column-based, with similar compression. Often used with Cloudera Impala.

  • Avro : from Apache Hadoop. Row-based, but uses a Json schema. Less capable support in Pandas. Often found in Apache Kafka clusters.

All are splittable, all are inscrutable to people, all describe their content with a schema, and all work with Hadoop. The column-based formats are considered best where cumulated data are read often; with multiple writes, Avro may be more suited. See e.g. https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/

Compression of the column formats can use SNAPPY (faster) or GZIP (slower but more compression).

You may also want to look into Protocol Buffers, Pickle (Python-specific) and Feather (for fast communication between Python and R).

Upvotes: 3

fralau
fralau

Reputation: 3839

I have been thinking on that problem for a while. I came up with a simple format that could work very well for your use case: JTable.

 {
    "header": ["Column1", "Column2", "Column3"],
    "rows"  : [
                ["aaa", "xxx", 1],
                ["bbb", “yyy”, 2],
                ["ccc", “zzz”, 3]
              ]
  }

If you wish, you can find a complete specification of the JTable format, with details and resources. But this is pretty self-explanatory and any programmer would know how to handle it. The only thing necessary is, really, to say, that this is JSON.

Upvotes: 2

SirDarius
SirDarius

Reputation: 42959

You could try YAML, its overhead is relatively small compared to formats such as XML or JSON.

Examples here: http://www.yaml.org/

Surprisingly, the website's text itself is YAML.

Upvotes: 4

Brian Driscoll
Brian Driscoll

Reputation: 19635

I'm sure you've already considered this, but I'm a fan of tab-delimited files (\t between fields, newline at the end of each row)

Upvotes: 6

John Gietzen
John Gietzen

Reputation: 49564

I would say that since CSV is the standard, and since everyone under the sun can parse it, use it.

If I were in your situation, I would take the bandwidth hit and use GZIP+XML, just because it's so darn easy to use.

And, on that note, you could always require that your users support GZIP and just send it as XML/JSON, since that should do a pretty good job of removing the redundancy accross the wire.

Upvotes: 4

Related Questions