Reputation: 556
I have yearly application data for different apps in json format. There are 10 different json files for each application. I try to merge them into a single csv. Let me first show you the data structure:
[{"date": "2017-10-23", "downloads": 15358985, "end": "2017-10-23", "data": {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}, {"date": "2017-10-22", "downloads": 12778233, "end": "2017-10-22", "data": {"2.7.3.4196-beta": 5, "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538, "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}]
When I parse them into pandas dataframe I get something like this:
date downloads end data
2017-10-23 15358985 2017-10-23 {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}
2017-10-22 12778233 2017-10-22 {"2.7.3.4196-beta": 5, "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538, "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}
Please notice that not all of the versions are downloaded everyday. How I could create a column for different versions of the application? If the application is not downloaded on particular day we could leave it blank or fill with NaNs
Upvotes: 2
Views: 1098
Reputation: 862901
I think you need DataFrame
constructor with reindex
for add missing rows:
j = [{"date": "2017-10-25", "downloads": 15358985, "end": "2017-10-23", "data": {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}, {"date": "2017-10-22", "downloads": 12778233, "end": "2017-10-22", "data": {"2.7.3.4196-beta": 5, "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538, "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}]
df = pd.DataFrame(j).set_index('date')
df.index = pd.to_datetime(df.index)
df = df.reindex(pd.date_range(start=df.index.min(), end=df.index.max()))
print (df)
data downloads \
2017-10-22 {'2.6.4.1-signed': 8, '2.99.0.1857beta': 4, '2... 12778233.0
2017-10-23 NaN NaN
2017-10-24 NaN NaN
2017-10-25 {'2.7.2.4151-beta': 1, '1.0.1': 268, '2.9.0.42... 15358985.0
end
2017-10-22 2017-10-22
2017-10-23 NaN
2017-10-24 NaN
2017-10-25 2017-10-23
Solution with json_normalize
, but if different formats of json
s get a lot of NaN
s values:
df = json_normalize(j).set_index('date')
df.index = pd.to_datetime(df.index)
#
df = df.reindex(pd.date_range(start=df.index.min(), end=df.index.max()))
print (df)
data.1.0.1 data.1.0.2 data.2.2.3.1-signed data.2.3.1.1-signed \
2017-10-22 NaN NaN NaN 3.0
2017-10-23 NaN NaN NaN NaN
2017-10-24 NaN NaN NaN NaN
2017-10-25 268.0 715.0 9292.0 NaN
data.2.4.1 data.2.6.10 data.2.6.4.1-signed \
2017-10-22 842.0 11538.0 8.0
2017-10-23 NaN NaN NaN
2017-10-24 NaN NaN NaN
2017-10-25 NaN NaN NaN
data.2.7.2.4151-beta data.2.7.3.4196-beta data.2.7.3.4198-beta \
2017-10-22 NaN 5.0 4.0
2017-10-23 NaN NaN NaN
2017-10-24 NaN NaN NaN
2017-10-25 1.0 7.0 NaN
data.2.7.3.4215-beta data.2.9.0.4250-beta data.2.99.0.1857beta \
2017-10-22 NaN NaN 4.0
2017-10-23 NaN NaN NaN
2017-10-24 NaN NaN NaN
2017-10-25 2.0 1.0 NaN
data.2.99.0.1872beta downloads end
2017-10-22 12.0 12778233.0 2017-10-22
2017-10-23 NaN NaN NaN
2017-10-24 NaN NaN NaN
2017-10-25 NaN 15358985.0 2017-10-23
Upvotes: 2