Reputation: 3341
I have a DataFrame structured as follows. (It is a result of JSON normalization)
mydf
id colA colB ... colArray
foo a1 b1 [{'date': '...', 'data1': '...', 'data2': 0.1 ...}, ...]
bar a2 b2 [{'date': '...', 'data1': '...', 'data2': 0.1 ...}, ...]
fooz a3 b3 [{'date': '...', 'data1': '...', 'data2': 0.1 ...}, ...]
barz a4 b4 [{'date': '...', 'data1': '...', 'data2': 0.1 ...}, ...]
date
are timestampscolArray
have a different length, but have exact same array element structure['id', 'colA', 'colB']
is an example of columns that I would like to use as unique indexesI would like to convert these data in order to use them as time series. My desired output would be something like:
id colA colB ... date data1 data2 ... data n
foo a1 b1 '1st timestamp' 'flex' 0.1
foo a1 b1 '...'
...
foo a1 b1 'last_timestamp'
bar a2 b2 '1st timestamp' 'zorg'
bar a2 b2 '...'
...
bar a2 b2 'last_timestamp'
fooz a3 b3 '...'
fooz a3 b3 '...'
...
fooz a3 b3 '...'
etc.
That would allow me to plot/analyze times series based on tuples such as [foo, a1, b1]
To me, this looks very similar to Flatten nested pandas dataframe, but the accepted answer is frustrating : the JSON/dict data is not really processed to produce a DataFrame with the correct data.
Does anyone have any advice on how to achieve this ?
Use the following, which is close to what I want:
tmpdf = pd.DataFrame(mydf['colArray'].tolist())
json_normalize(tmpdf[0])
But there are 2 issues:
['id', 'colA', 'colB']
tuple that i would like to use as a unique identifier.Based on Accessing nested JSON data as dataframes in Pandas
pd.concat(pd.DataFrame.from_dict(tmp_array) for array in mydf['colArray'])
It gives me a dataframe with all my arrays flattened, correct columns names, but I lost the correspond keys ( ['id', 'colA', 'colB']
).
I feel this is the correct approach, but I can't figure out how to keep indexing columns (so that I can filter each resulting time series by indexing columns).
Too bad there is no "json_melt" function
Based on this question Flatten nested pandas dataframe. I can preserve my indexing columns, but the array elements are still in JSON and indexed as [0, 1, 2, ...]. I will have trouble dealing with the variable length (lots of NA for the higher values of columns indexes
Bibliography: Create a Pandas DataFrame from deeply nested JSON But the solution is based on the original JSON processing, whereas I would like to do this on an existing DataFrame
Accessing nested JSON data as dataframes in Pandas This is pretty close to what I want.
Flatten nested pandas dataframe The result looks like my first attempt, but the underlying JSON data is not really "matrixed" into the dataframe.
A rather complex and not satisfaying approach
EDIT: This question is the same But at time of asking, I could not find it via search. For future reference ?
Upvotes: 3
Views: 4583
Reputation: 862691
Use dictionary comprehension with pop
for extract original column and concat
for MulltiIndex
:
df = pd.concat({k: pd.DataFrame(array) for k, array in mydf.pop('colArray').items()})
Alternative is use parameter keys
:
df = pd.concat([pd.DataFrame(array) for array in mydf.pop('colArray')], keys=mydf.index)
Then remove second level, so possible join
with original DataFrame
:
df = df.reset_index(level=1, drop=True).join(mydf).reset_index(drop=True)
Sample:
mydf = pd.DataFrame({'id': ['foo', 'bar', 'fooz', 'barz'], 'colA': ['a1', 'a2', 'a3', 'a4'], 'colB': ['b1', 'b2', 'b3', 'b4'], 'colArray': [[{'date': 's', 'data1': 't', 'data2': 0.1}, {'date': 'd', 'data1': 'r', 'data2': 0.8}], [{'date': 'd', 'data1': 'y', 'data2': 0.1}], [{'date': 'g', 'data1': 'u', 'data2': 0.1}], [{'date': 'h', 'data1': 'i', 'data2': 0.1}]]})
print (mydf)
id colA colB colArray
0 foo a1 b1 [{'date': 's', 'data1': 't', 'data2': 0.1}, {'...
1 bar a2 b2 [{'date': 'd', 'data1': 'y', 'data2': 0.1}]
2 fooz a3 b3 [{'date': 'g', 'data1': 'u', 'data2': 0.1}]
3 barz a4 b4 [{'date': 'h', 'data1': 'i', 'data2': 0.1}]
df = pd.concat({k: pd.DataFrame(array) for k, array in mydf.pop('colArray').items()})
print (df)
data1 data2 date
0 0 t 0.1 s
1 r 0.8 d
1 0 y 0.1 d
2 0 u 0.1 g
3 0 i 0.1 h
df = df.reset_index(level=1, drop=True).join(mydf).reset_index(drop=True)
print (df)
data1 data2 date id colA colB
0 t 0.1 s foo a1 b1
1 r 0.8 d foo a1 b1
2 y 0.1 d bar a2 b2
3 u 0.1 g fooz a3 b3
4 i 0.1 h barz a4 b4
Upvotes: 3