Reputation: 1936
I am collecting a large amount of data which is most likely going to be a format as follows:
User 1: (a,o,x,y,z,t,h,u)
Where all the variables dynamically change with respect to time, except u - this is used to store the user name. What I am trying to understand since my background is not very intense in "big data", is when I end up with my array, it will be very large, something like 108000 x 3500, since I will be preforming analysis on each timestep, and graphing it, what would be an appropriate database to manage this in is what I am trying to determine. Since this is for scientific research I was looking at CDF and HDF5, and based on what I read here NASA I think I will want to use CDF. But is this the correct way to manage such data for speed and efficiency?
The final data set will have all the users as columns, and the rows will be timestamped, so my analysis program would read row by row to interpret the data. And make entries into the dataset. Maybe I should be looking at things like CouchDB and RDBMS, I just don't know a good place to start. Advice would be appreciated.
Upvotes: 5
Views: 1570
Reputation: 497
I have been using CDF for some similarly sized data and I think it should work nicely. You will need to keep a few things in mind though. Considering I don't really know the details of your project, this may or may not be helpful...
3GB of data is right around the file size limit for the older version of CDF, so make sure you are using an up-to-date library.
While 3GB isn't that much data, depending on how you read and write it, things may be slow going. Make sure you use the hyper read/write functions whenever possible.
CDF supports meta-data (called global/variable attributes) that can hold information such as username and data descriptions.
It is easy to break data up into multiple files. I would recommend using one file per user. This will mean that you can write the user name just once for the whole file as an attribute, rather than in each record.
You will need to create an extra variable called epoch. This is well defined timestamp for each record. I am not sure if the time stamp you have now would be appropriate, or if you will need to process it some, but it is something you need to think about. Also, the epoch variable needs to have a specific type assigned to it (epoch, epoch16, or TT2000). TT2000 is the most recent version which gives nanosecond precision and handles leap seconds, but most CDF readers that I have run into don't handle it well yet. If you don't need that kind of precision, I recommend epoch16 as that has been the standard for a while.
Hope this helps, if you go with CDF, feel free to bug me with any issues you hit.
Upvotes: 3
Reputation: 78316
This is an extended comment rather than a comprehensive answer ...
With respect, a dataset of size 108000*3500
doesn't really qualify as big data these days, not unless you've omitted a unit such as GB
. If it's just 108000*3500
bytes, that's only 3GB plus change. Any of the technologies you mention will cope with that with ease. I think you ought to make your choice on the basis of which approach will speed your development rather than speeding your execution.
But if you want further suggestions to consider, I suggest:
all of which have some traction in the academic big data community and are beginning to be used outside that community too.
Upvotes: 6