TheLeache
TheLeache

Reputation: 1

Extract values from XArray's DataArray to column using indices

So, I'm doing something that is maybe a bit unorthodox, I have a number of 9-billion pixel raster maps based on the NLCD, and I want to get the values from these rasters for the pixels which have ever been built-up, which are about 500 million:

built_up_index = pandas.DataFrame(np.column_stack(np.where(unbuilt == 0)), columns = ["row", "column"]).sort_values(["row", "column"])

That piece of code above gives me a dataframe where one column is the row index and the other is the column index of all the pixels which show construction in any of the NLCD raster maps (unbuilt is the ones and zeros raster which contains that).

I want to use this to then read values from these NLCD maps and others, so that each pixel is a row and each column is a variable, say, its value in the NLCD 2001, then its value in 2004, 2006 and so on (As well as other indices I have calculated). So the dataframe would look as such:

|row | column | value_2001 | value_2004 | var3 | ...

(VALUES HERE)

I have tried the following thing:

test = sprawl_2001.isel({'y': np.array(built_up_frame.iloc[:,0]), 'x': np.array(built_up_frame.iloc[:,1])}, drop = True).to_dataset(name="var").to_dataframe()

which works if I take a subsample as such:

test = sprawl_2001.isel({'y': np.array(built_up_frame.iloc[0:10000,0]), 'x': np.array(built_up_frame.iloc[0:10000,1])}, drop = True).to_dataset(name="var").to_dataframe()

but it doesn't do what I want, because the length is squared, as it seems it's trying to create a 2-d array which it then flattens, when what I want is a vector containing the values of the pixels I subsampled.

I could obviously do this in a loop, pixel by pixel, but I imagine this would be extremely slow for 500 million values and there has to be a more efficient way.

Any advice here?

EDIT: In the end I gave up on using the index, because I get the impression Xarrays will only make an array of the same dimensions (about 161000 columns and 104000 rows) as my original dataset with a bunch of missing values, rather than creating a column vector with the values I want. I'm using np.extract:

def src_to_frame(src, unbuilt, varname):
    return pd.DataFrame(np.extract(unbuilt == 0, src), columns=[varname])

where src is the raster containing the variable of interest, unbuilt is the raster of the same size where 0s are the pixels that have ever been built, and varname is the name of the variable. It does what I want and fits in the RAM I have. Maybe not the most optimal, but it works!

Upvotes: 0

Views: 987

Answers (1)

Michael Delgado
Michael Delgado

Reputation: 15442

This looks like a good application for advanced indexing with DataArrays

sprawl_2001.isel(
    y=built_up_frame.iloc[0:10000,0].to_xarray(), 
    x=built_up_frame.iloc[0:10000,1].to_xarray(),
).to_dataset(name="var").to_dataframe()

Upvotes: 0

Related Questions