Data format and database choices Spark/hadoop

Question

I am working on structured data (one value per field, the same fields for each row) that I have to put in a NoSql environment with Spark (as analysing tool) and Hadoop. Though, I am wondering what format to use. i was thinking about json or csv but I'm not sure. What do you think and why? I don't have enough experience in this field to properly decide.

2nd question : I have to analyse these data (stored in an HDFS). So, as far as I know I have two possibilities to query them (before the analysis):

direct reading and filtering. i mean that it can be done with Spark, for exemple:
```
data = sqlCtxt.read.json(path_data)
```
Use Hbase/Hive to properly make a query and then process the data.

So, I don't know what is the standard way of doing all this and above all, what will be the fastest. Thank you by advance!

Data format and database choices Spark/hadoop

Answers (1)

Related Questions