Reputation: 8162
I intend to build a RESTful service which will return a custom text format. Given my very large volumes of data, XML/JSON is too verbose. I'm looking for a row based text format.
CSV is an obvious candidate. I'm however wondering if there isn't something better out there. The only I've found through a bit of research is CTX and Fielded Text.
I'm looking for a format which offers the following:
Fielded text is looking pretty good and I could definitely build a specification myself, but I'm curious to know what others have done given that this must be a very old problem. It's surprising that there isn't a better standard out there.
What suggestions do you have?
Upvotes: 13
Views: 10984
Reputation: 171
Looking through the existing answers, most struck me as a bit dated. Especially in terms of 'big data', noteworthy alternatives to CSV include:
ORC : 'Optimised Row Columnar' uses row storage, useful in Python/Pandas. Originated in HIVE, optimised by Hortonworks. Schema is in the footer. The Wikipedia entry is currently quite terse https://en.wikipedia.org/wiki/Apache_ORC but Apache has a lot of detail.
Parquet : Similarly column-based, with similar compression. Often used with Cloudera Impala.
Avro : from Apache Hadoop. Row-based, but uses a Json schema. Less capable support in Pandas. Often found in Apache Kafka clusters.
All are splittable, all are inscrutable to people, all describe their content with a schema, and all work with Hadoop. The column-based formats are considered best where cumulated data are read often; with multiple writes, Avro may be more suited. See e.g. https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/
Compression of the column formats can use SNAPPY (faster) or GZIP (slower but more compression).
You may also want to look into Protocol Buffers, Pickle (Python-specific) and Feather (for fast communication between Python and R).
Upvotes: 3
Reputation: 3839
I have been thinking on that problem for a while. I came up with a simple format that could work very well for your use case: JTable.
{
"header": ["Column1", "Column2", "Column3"],
"rows" : [
["aaa", "xxx", 1],
["bbb", “yyy”, 2],
["ccc", “zzz”, 3]
]
}
If you wish, you can find a complete specification of the JTable format, with details and resources. But this is pretty self-explanatory and any programmer would know how to handle it. The only thing necessary is, really, to say, that this is JSON.
Upvotes: 2
Reputation: 42959
You could try YAML, its overhead is relatively small compared to formats such as XML or JSON.
Examples here: http://www.yaml.org/
Surprisingly, the website's text itself is YAML.
Upvotes: 4
Reputation: 19635
I'm sure you've already considered this, but I'm a fan of tab-delimited files (\t between fields, newline at the end of each row)
Upvotes: 6
Reputation: 49564
I would say that since CSV is the standard, and since everyone under the sun can parse it, use it.
If I were in your situation, I would take the bandwidth hit and use GZIP+XML, just because it's so darn easy to use.
And, on that note, you could always require that your users support GZIP and just send it as XML/JSON, since that should do a pretty good job of removing the redundancy accross the wire.
Upvotes: 4