Nested lists in DataFrame column: how to carry out calculations on individual values?

Question

I am trying to carrying out calculations on individual values that are stored in a nested list stored in a pandas DataFrame. My issue is on how to access these individual values.

I am working from a data set available here: https://datadryad.org/stash/dataset/doi:10.5061/dryad.h505v

I have imported the .json file in a pandas DataFrame and the elastic constants are stored in the column 'elastic_tensor'.

import pandas as pd
df = pd.read_json(workdir+"ec.json")
df['elastic_tensor'].head()

Out:

0    [[311.33514638650246, 144.45092552856926, 126....
1    [[306.93357350984974, 88.02634955100905, 105.6...
2    [[569.5291276937579, 157.8517489654999, 157.85...
3    [[69.28798774976904, 34.7875015216915, 37.3877...
4    [[349.3767766177825, 186.67131003104407, 176.4...
Name: elastic_tensor, dtype: object

In order to access the individual values, what I have done is expand the nested lists once (as I could not find a way to use .extend() to flatten the nested list):

df1 = pd.DataFrame(df["elastic_tensor"].to_list() , columns=['c'+str(j) for j in range(1,7)])

Note: I have named the columns c1..c6 as the elastic constants in the end shall be called cij with i and j from 1 to 6.

Then I have expanded each of these columns in turns (as I could not find the way to do a loop):

dfc1 = pd.DataFrame(df1["c1"].to_list() , columns=['c1'+str(j) for j in range(1,7)])
dfc2 = pd.DataFrame(df1["c2"].to_list() , columns=['c2'+str(j) for j in range(1,7)])
dfc3 = pd.DataFrame(df1["c3"].to_list() , columns=['c3'+str(j) for j in range(1,7)])
dfc4 = pd.DataFrame(df1["c4"].to_list() , columns=['c4'+str(j) for j in range(1,7)])
dfc5 = pd.DataFrame(df1["c5"].to_list() , columns=['c5'+str(j) for j in range(1,7)])
dfc6 = pd.DataFrame(df1["c6"].to_list() , columns=['c6'+str(j) for j in range(1,7)])

before merging them

data_frames = [dfc1, dfc2, dfc3, dfc4, dfc5, dfc6]
df_merged = pd.DataFrame().join(data_frames, how="outer")

which gives me a DataFrame with columns containing the individual cij values: https://i.sstatic.net/odraQ.png

I can now carry out arithmetic operations on these individual values and add a column in the initial "df" dataframe with the results, but there must be a better way of doing it (especially if the matrices are large). Any idea?

Rob Raymond · Accepted Answer

approach using apply(pd.Series) to expand a list into columns
using stack() and unstack() generate multi-index columns that are zero-indexed values into 2D list
flatten multi-index to match your stated requirement (one-indexed instead of zero indexed)

import json
from pathlib import Path
# file downloaded from https://datadryad.org/stash/dataset/doi:10.5061/dryad.h505v
with open(Path.cwd().joinpath("ec.json")) as f: js = json.load(f)
df = pd.json_normalize(js)
# expand first dimension, put it into row index, expand second dimension, make multi-index columns
dfet = df["elastic_tensor"].apply(pd.Series).stack().apply(pd.Series).unstack()
# flatten multi-index columns, index from 1, instead of standard 0
dfet.columns = [f"c{i+1}{j+1}" for i,j in dfet.columns.to_flat_index()]

head(5)

	c11	c12	c13	c14	c15	c16	c21	c22	c23	c24	c25	c26	c31	c32	c33	c34	c35	c36	c41	c42	c43	c44	c51	c52	c53	c55	c56	c61	c62	c63	c65	c66
0	311.335	144.451	126.176	0	-0.110347	0	144.451	311.32	126.169	0	-0.112161	0	126.176	126.169	332.185	0	-0.107541	0	0	0	0	98.9182	-0.110347	-0.112161	-0.107541	98.921	0	0	0	0	0	103.339
1	306.934	88.0263	105.696	2.53622	-0.568262	-0.188934	88.0263	298.869	101.79	-1.43474	-0.608261	-0.226253	105.696	101.79	398.441	0.350166	-0.577829	-0.232358	2.53622	-1.43474	0.350166	75.3104	-0.568262	-0.608261	-0.577829	75.5826	1.92806	-0.188934	-0.226253	-0.232358	1.92806	105.685
2	569.529	157.852	157.851	0	0	0	157.852	569.53	157.852	0	0	0	157.851	157.852	569.53	0	0	0	0	0	0	94.8801	0	0	0	94.88	0	0	0	0	0	94.8801
3	69.288	34.7875	37.3877	0	0	0	34.7875	78.1379	40.6047	0	0	0	37.3877	40.6047	70.1326	0	0	0	0	0	0	19.8954	0	0	0	4.75803	0	0	0	0	0	30.4095
4	349.377	186.671	176.476	0	0	0	186.671	415.51	213.834	0	0	0	176.476	213.834	407.479	0	0	0	0	0	0	120.112	0	0	0	125.443	0	0	0	0	0	74.9078

numpy approach

a = np.dstack(df["elastic_tensor"])
pd.DataFrame(a.reshape((a.shape[0]*a.shape[1], a.shape[2])).T, 
             columns=[f"c{i+1}{j+1}" for i in range(a.shape[0]) for j in range(a.shape[1])])

Nested lists in DataFrame column: how to carry out calculations on individual values?

Answers (1)

head(5)

numpy approach

Related Questions