ZuLu
ZuLu

Reputation: 962

Simulating correlated multivariate data

I'm trying to generate synthetic realizations from historical hurricane data. A hurricane is parameterized in my problem by a set of descriptors (i.e. storm size, storm intensity, storm speed, and storm heading - all referenced to the values at the time the hurricane crosses some shoreline). The realizations will be used to make probabilistic forecasts of hurricane-generated flooding. The assumption is that the historical hurricane data comes from some underlying multivariate distribution. The idea is to draw additional samples from this underlying distribution (preserving moments, correlation, physical bounds such as positive storm size, etc).

I've implemented a nearest neighbor Gaussian dispersion method modified from a technique developed by Taylor and Thompson - published in Computational Statistics and Data Analysis, 1986. I'd like to see if there are better ways to do this.

Data sample (Gulf of Mexico hurricanes 1940-2005):

Gulf of Mexico hurricanes 1940-2005

def TT_alg(data_list, sample_size, num_neighbors=5, metric=2):

    dummy_list = []
    dimension = len(data_list[0])

    # transform the data to the interval [0,1]
    aa = numpy.array([(max([row[i] for row in data_list]) - min([row[i] for row in   data_list])) for i in range(dimension)])
    bb = numpy.array([min([row[j] for row in data_list]) for j in range(dimension)])

    data_array = numpy.array(data_list)
    data_array_normed = (data_array - bb) / aa

    # setup nearest neighbor tree
    tree = scipy.spatial.KDTree(data_array_normed)

    # perform nearest neighbor random walk
    for ijk in range(sample_size):

        sample = random.choice(data_array_normed)

        kNN = tree.query(sample, k=num_neighbors, p=metric)
        x_mu = numpy.array([numpy.average([data_array_normed[i][j] for i in kNN[1]]) for j in range(dimension)])
        x_si = numpy.array([numpy.std([data_array_normed[i][j] for i in kNN[1]]) for j in range(dimension)])
        s_gs = [numpy.random.normal(mu, si) for mu, si in zip(x_mu, x_si)]
        dummy_list.append(s_gs)

    dummy_array = numpy.array(dummy_list)

    # go back to original scale
    data_array_unnormed = (dummy_array * aa) + bb

    return data_array_unnormed.tolist()

Example for neighborhood_size=5 and distance_metric=Euclidean.

enter image description here

Upvotes: 2

Views: 1581

Answers (1)

pjs
pjs

Reputation: 19855

Your data are almost certainly not gaussian, the speed, intensity, and size must all be positive and size and intensity are clearly skewed. A log-normal distribution is plausible. I'd recommend log-transforming your data before attempting distributional fits.

One way to try to capture the correlation structure (which is definitely present in your posted data!) would be to estimate the mean M and variance/covariance matrix V of the log-transformed data. Then decompose the variance/covariance matrix using Cholesky decomposition to get V = transpose(C) C. If Z is a vector of independent normals, then X = M + transpose(C) Z will be a vector of correlated normals with the desired mean and variance/covariance structure. Exponentiating the elements of X will yield your simulated results. The results should avoid artifacts such as the negative storm sizes visible in your last graph. See this paper for more details.

Upvotes: 1

Related Questions