Bangyou
Bangyou

Reputation: 9816

A better approach to store, write, and read a big dataset of meteorological data

I am looking a better to store, write and read meteorological data (about 30 GB in raw text format).

Currently I am using NetCDF file format to store weather records. In this NetCDF file, I have 3 dimensions: time, climate variables, locations. But the dimension order is the key constrain for my tasks (see below).

The first task is to update weather records every day for about 3000 weather stations. Dimension order (time, var, name) provides the best performance for writing as the new data will be added at the end of NetCDF file.

The second task is to read all daily weather record for a station to preform analyzing. Dimension order (name, var, time) provides the best performance for reading as all records of one site are stored together.

The two tasks have the conflict designs of NetCDF file (best performance in one task, but the worst performance in another task).

My questions is whether there are alternative methods/software/data format to store, write and read my datasets to provide best performance of my two tasks? As I have to repeat the two steps everyday and data analyzing is time consuming, I need to find a best way to minimize I/O.

Thanks for any suggestions. Please let me know if my question is not clear.

Upvotes: 0

Views: 1090

Answers (1)

kakk11
kakk11

Reputation: 918

Ok, what You need is chunking. I created a small python script to test, without chunking it basically confirms Your observation that access is slow in one dimension. I tested with station number 3000, variables per station 10 and timesteps 10000. I did put stations and variables into the same dimension for testing, but it should give similar results in 3D case if You really need it. My test output without chunking:

File chunking type: None Variable shape: (30000, 10000) Total time, file creation: 13.665503025054932 Average time for adding one measurement time: 0.00136328568459 0.00148195505142 0.0018851685524 Read all timeseries one by one with single file open Average read time per station/variable: 0.524109539986

And with chunking:

File chunking type: [100, 100] Variable shape: (30000, 10000) Total time, file creation: 18.610711812973022 Average time for adding one measurement time: 0.00185681316853 0.00168470859528 0.00213300466537 Read all timeseries one by one with single file open Average read time per station/variable: 0.000948731899261

What You can see is that chunking increases write time by about 50%, but dramatically improves read time. I did not try to optimise chunk sizes, just tested that it works in the right direction. Feel free to ask if code is not clear or You are not familiar with python.

# -*- coding: utf-8 -*-
from time import time
import numpy as np
from netCDF4 import Dataset

test_dataset_name='test_dataset.nc4'
num_stations=3000 
num_vars=10 
chunks=None
#chunks=[100,100]

def create_dataset():
    ff=Dataset(test_dataset_name,'w')
    ff.createDimension('Time',None)
    ff.createDimension('Station_variable',num_stations*num_vars)
    if chunks:
        var1=ff.createVariable('TimeSeries','f8',   ('Station_variable','Time'),chunksizes=chunks)
    else:
        var1=ff.createVariable('TimeSeries','f8',('Station_variable','Time'))
    return ff

def add_data(ff,timedim):
    var1=ff.variables['TimeSeries']
    var1[0:1000,timedim]=timedim*np.ones((1000),'f8')

def dataset_close(inds):
    inds.close()

## CREATE DATA FILE    
time_start=time()
time1=[]
time2=[]
time3=[]
time4=[]
testds=create_dataset()
dataset_close(testds)
for i in range(10000):
    time1.append(time())
    ff=Dataset(test_dataset_name,'a')
    time2.append(time())
    add_data(ff,i)
    time3.append(time())
    ff.sync()
    ff.close()
    time4.append(time())
time_end=time()

time1=np.array(time1)
time2=np.array(time2)
time3=np.array(time3)
time4=np.array(time4)

## READ ALL STAION-VARIABLE COMBINATIONS AS ONE TIMESERIES
ff=Dataset(test_dataset_name,'r')
## PRINT DATA FILE CREATION SUMMARY
print("File chunking type:",chunks)
print("Variable shape:",ff.variables['TimeSeries'][:].shape)
print("Total time, file creation:", time_end-time_start)
print("Average time for adding one measurement time: ",np.mean(time4-    time1), np.mean(time4[:100]-time1[:100]),np.mean(time4[-100:]- time1[-100:]))
print("Read all timeseries one by one with single file open")
time_rstart=[]
time_rend=[]
for i in range(0,ff.variables['TimeSeries'][:].shape[0],int(ff.variables['TimeSeries'][:].shape[0]/100)):
    time_rstart.append(time())
    dataline=ff.variables['TimeSeries'][i,:]
    time_rend.append(time())
time_rstart=np.array(time_rstart)
time_rend=np.array(time_rend)
print("Average read time per station/variable: ",np.mean(time_rend-  time_rstart))

Upvotes: 2

Related Questions