Reputation: 8222
I have to log alot of data, which will be analyzed later on. I am currently not analyzing it. Later on we will analyze it using Hadoop. How ? I dont know. But size of log are too much.
So I am looking into a format which will take less size and will be easy to analyze later on.
I thought of saving it as coma separated value, but log may contain comma and newlines. Then I thought of encoding it using JSON or Making each field as BASE64 encode. but then I don't know if we will be able to analyze it later on.
What log format should I use, which will be easier to analyze later on ?
Upvotes: 2
Views: 115
Reputation: 8222
As suggested by one of the engineer from www.qubole.com. I used csv format, because querying on terabytes of log file using hadoop is more expensive (time taking) when using JSON encoded lines.
Upvotes: 1
Reputation: 4363
As long as you generate your log statement with a well-structured format string you should be able to usefully parse it later; likely with a regular expression.
JSON will bloat your log horribly and not improve your ability to parse it. The only scenario where it might make sense is where you need to dump objects in your log.
Upvotes: 2
Reputation: 57723
CSV allows you to escape data like:
1,2,"value with, comma","value with
newline","value with "" quote"
1,2,"foo","bar","baz"
So commas or newlines should be no problem. Use fputcsv
when writing to the file.
CSV probably gets you the smallest filesize since the delimiter overhead is minimal.
If space is an issue can you can always just gzip compress the files.
Base64 typically inflates data by about 33%
Upvotes: 1