LoneWanderer
LoneWanderer

Reputation: 3341

Flatten DataFrame nested list/array with extra index keys (for time series)

I have a DataFrame structured as follows. (It is a result of JSON normalization)

mydf

id    colA    colB    ...    colArray
foo   a1      b1             [{'date': '...', 'data1': '...', 'data2': 0.1 ...}, ...]
bar   a2      b2             [{'date': '...', 'data1': '...', 'data2': 0.1 ...}, ...]
fooz  a3      b3             [{'date': '...', 'data1': '...', 'data2': 0.1 ...}, ...]
barz  a4      b4             [{'date': '...', 'data1': '...', 'data2': 0.1 ...}, ...]

I would like to convert these data in order to use them as time series. My desired output would be something like:

id    colA    colB    ...    date               data1    data2 ... data n
foo   a1      b1             '1st timestamp'   'flex'   0.1
foo   a1      b1             '...'   
...
foo   a1      b1             'last_timestamp'   
bar   a2      b2             '1st timestamp'   'zorg'
bar   a2      b2             '...'   
...   
bar   a2      b2             'last_timestamp'   
fooz  a3      b3             '...'   
fooz  a3      b3             '...'   
...
fooz  a3      b3             '...'   
etc.

That would allow me to plot/analyze times series based on tuples such as [foo, a1, b1]

To me, this looks very similar to Flatten nested pandas dataframe, but the accepted answer is frustrating : the JSON/dict data is not really processed to produce a DataFrame with the correct data.


Does anyone have any advice on how to achieve this ?


First approach

Use the following, which is close to what I want:

tmpdf = pd.DataFrame(mydf['colArray'].tolist())
json_normalize(tmpdf[0])

But there are 2 issues:

  1. I lost the ['id', 'colA', 'colB'] tuple that i would like to use as a unique identifier.
  2. I need to perform the operation for each row of my tmpdf

Second method

Based on Accessing nested JSON data as dataframes in Pandas

pd.concat(pd.DataFrame.from_dict(tmp_array) for array in mydf['colArray'])

It gives me a dataframe with all my arrays flattened, correct columns names, but I lost the correspond keys ( ['id', 'colA', 'colB']). I feel this is the correct approach, but I can't figure out how to keep indexing columns (so that I can filter each resulting time series by indexing columns).

Too bad there is no "json_melt" function

Third method

Based on this question Flatten nested pandas dataframe. I can preserve my indexing columns, but the array elements are still in JSON and indexed as [0, 1, 2, ...]. I will have trouble dealing with the variable length (lots of NA for the higher values of columns indexes


Bibliography: Create a Pandas DataFrame from deeply nested JSON But the solution is based on the original JSON processing, whereas I would like to do this on an existing DataFrame

Accessing nested JSON data as dataframes in Pandas This is pretty close to what I want.

Flatten nested pandas dataframe The result looks like my first attempt, but the underlying JSON data is not really "matrixed" into the dataframe.

A rather complex and not satisfaying approach

EDIT: This question is the same But at time of asking, I could not find it via search. For future reference ?

Upvotes: 3

Views: 4583

Answers (1)

jezrael
jezrael

Reputation: 862691

Use dictionary comprehension with pop for extract original column and concat for MulltiIndex:

df = pd.concat({k: pd.DataFrame(array) for k, array in mydf.pop('colArray').items()})

Alternative is use parameter keys:

df = pd.concat([pd.DataFrame(array) for array in mydf.pop('colArray')], keys=mydf.index)

Then remove second level, so possible join with original DataFrame:

df = df.reset_index(level=1, drop=True).join(mydf).reset_index(drop=True)

Sample:

mydf = pd.DataFrame({'id': ['foo', 'bar', 'fooz', 'barz'], 'colA': ['a1', 'a2', 'a3', 'a4'], 'colB': ['b1', 'b2', 'b3', 'b4'], 'colArray': [[{'date': 's', 'data1': 't', 'data2': 0.1}, {'date': 'd', 'data1': 'r', 'data2': 0.8}], [{'date': 'd', 'data1': 'y', 'data2': 0.1}], [{'date': 'g', 'data1': 'u', 'data2': 0.1}], [{'date': 'h', 'data1': 'i', 'data2': 0.1}]]})
print (mydf)
     id colA colB                                           colArray
0   foo   a1   b1  [{'date': 's', 'data1': 't', 'data2': 0.1}, {'...
1   bar   a2   b2        [{'date': 'd', 'data1': 'y', 'data2': 0.1}]
2  fooz   a3   b3        [{'date': 'g', 'data1': 'u', 'data2': 0.1}]
3  barz   a4   b4        [{'date': 'h', 'data1': 'i', 'data2': 0.1}]

df = pd.concat({k: pd.DataFrame(array) for k, array in mydf.pop('colArray').items()})
print (df)
    data1  data2 date
0 0     t    0.1    s
  1     r    0.8    d
1 0     y    0.1    d
2 0     u    0.1    g
3 0     i    0.1    h

df = df.reset_index(level=1, drop=True).join(mydf).reset_index(drop=True)
print (df)
  data1  data2 date    id colA colB
0     t    0.1    s   foo   a1   b1
1     r    0.8    d   foo   a1   b1
2     y    0.1    d   bar   a2   b2
3     u    0.1    g  fooz   a3   b3
4     i    0.1    h  barz   a4   b4

Upvotes: 3

Related Questions