Dogukan Yılmaz
Dogukan Yılmaz

Reputation: 556

Parsing unstructured json into csv

I have yearly application data for different apps in json format. There are 10 different json files for each application. I try to merge them into a single csv. Let me first show you the data structure:

[{"date": "2017-10-23", "downloads": 15358985, "end": "2017-10-23", "data": {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}, {"date": "2017-10-22", "downloads": 12778233, "end": "2017-10-22", "data": {"2.7.3.4196-beta": 5,  "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538,  "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}]

When I parse them into pandas dataframe I get something like this:

date         downloads  end         data

2017-10-23   15358985   2017-10-23  {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}
2017-10-22   12778233   2017-10-22  {"2.7.3.4196-beta": 5,  "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538,  "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}

Please notice that not all of the versions are downloaded everyday. How I could create a column for different versions of the application? If the application is not downloaded on particular day we could leave it blank or fill with NaNs

Upvotes: 2

Views: 1098

Answers (1)

jezrael
jezrael

Reputation: 862901

I think you need DataFrame constructor with reindex for add missing rows:

j = [{"date": "2017-10-25", "downloads": 15358985, "end": "2017-10-23", "data": {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}, {"date": "2017-10-22", "downloads": 12778233, "end": "2017-10-22", "data": {"2.7.3.4196-beta": 5,  "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538,  "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}]

df = pd.DataFrame(j).set_index('date')
df.index = pd.to_datetime(df.index)

df = df.reindex(pd.date_range(start=df.index.min(), end=df.index.max()))
print (df)
                                                         data   downloads  \
2017-10-22  {'2.6.4.1-signed': 8, '2.99.0.1857beta': 4, '2...  12778233.0   
2017-10-23                                                NaN         NaN   
2017-10-24                                                NaN         NaN   
2017-10-25  {'2.7.2.4151-beta': 1, '1.0.1': 268, '2.9.0.42...  15358985.0   

                   end  
2017-10-22  2017-10-22  
2017-10-23         NaN  
2017-10-24         NaN  
2017-10-25  2017-10-23  

Solution with json_normalize, but if different formats of jsons get a lot of NaNs values:

df = json_normalize(j).set_index('date')
df.index = pd.to_datetime(df.index)
#
df = df.reindex(pd.date_range(start=df.index.min(), end=df.index.max()))
print (df)
            data.1.0.1  data.1.0.2  data.2.2.3.1-signed  data.2.3.1.1-signed  \
2017-10-22         NaN         NaN                  NaN                  3.0   
2017-10-23         NaN         NaN                  NaN                  NaN   
2017-10-24         NaN         NaN                  NaN                  NaN   
2017-10-25       268.0       715.0               9292.0                  NaN   

            data.2.4.1  data.2.6.10  data.2.6.4.1-signed  \
2017-10-22       842.0      11538.0                  8.0   
2017-10-23         NaN          NaN                  NaN   
2017-10-24         NaN          NaN                  NaN   
2017-10-25         NaN          NaN                  NaN   

            data.2.7.2.4151-beta  data.2.7.3.4196-beta  data.2.7.3.4198-beta  \
2017-10-22                   NaN                   5.0                   4.0   
2017-10-23                   NaN                   NaN                   NaN   
2017-10-24                   NaN                   NaN                   NaN   
2017-10-25                   1.0                   7.0                   NaN   

            data.2.7.3.4215-beta  data.2.9.0.4250-beta  data.2.99.0.1857beta  \
2017-10-22                   NaN                   NaN                   4.0   
2017-10-23                   NaN                   NaN                   NaN   
2017-10-24                   NaN                   NaN                   NaN   
2017-10-25                   2.0                   1.0                   NaN   

            data.2.99.0.1872beta   downloads         end  
2017-10-22                  12.0  12778233.0  2017-10-22  
2017-10-23                   NaN         NaN         NaN  
2017-10-24                   NaN         NaN         NaN  
2017-10-25                   NaN  15358985.0  2017-10-23  

Upvotes: 2

Related Questions