ayush singhal
ayush singhal

Reputation: 1939

working with big scientific data on Hadoop

I am currently starting a project titled "Cloud computing for time series mining algorithms using Hadoop". The data which I have is hdf files of size over a terabyte.In hadoop as I know that we should have text files as input for further processing (map-reduce task). So I have one option that I convert all my .hdf files to text files which is going to take a lot of time.

Or I find a way of how to use raw hdf files in map reduce programmes. So far I have not been successful in finding any java code which reads hdf files and extract data from them. If somebody has a better idea of how to work with hdf files I will really appreciate such help.

Thanks Ayush

Upvotes: 3

Views: 4741

Answers (4)

wayi
wayi

Reputation: 523

SciMATE http://www.cse.ohio-state.edu/~wayi/papers/SciMATE.pdf is a good option. It is developed based on a variant of MapReduce, which has been shown to perform a lot of scientific applications much more efficiently than Hadoop.

Upvotes: 1

Animesh Raj Jha
Animesh Raj Jha

Reputation: 2724

If you do not find any java code and can do in other languages then you can use hadoop streaming.

Upvotes: 1

Sagar
Sagar

Reputation: 36

For your first option, you could use a conversion tool like HDF dump to dump HDF file to text format. Otherwise, you can write a program using Java library for reading HDF file and write it to text file.

For your second option, SciHadoop is a good example of how to read Scientific datasets from Hadoop. It uses NetCDF-Java library to read NetCDF file. Hadoop does not support POSIX API for file IO. So, it uses an extra software layer to translate POSIX call of NetCDF-java library to HDFS(Hadoop) API calls. If SciHadoop does not already support HDF files, you might go along a little harder path and develop a similar solution yourself.

Upvotes: 2

Ümit
Ümit

Reputation: 17489

Here are some resources:

  • SciHadoop (uses netCDF but might be already extended to HDF5).
  • You can either use JHDF5 or the lower level official Java HDF5 interface to read out data from any HDF5 file in the map-reduce task.

Upvotes: 3

Related Questions