AJ AJ
AJ AJ

Reputation: 99

Reading a JSON File into a dataframe WITHOUT using the json module

I have a large json file that I need to read into a pandas dataframe without using the json module. Here is a link to the file melbourne_bike_share.json. I didn't know what to cut out to make a minimal example.

Every way I try to get from reading the file to a dataframe, I get the same error:

ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.

I have tried reading it straight in:

import pandas as pd
mbs = pd.read_json('Melbourne_bike_share.json')

and making sure my string to be read by pd.read_json() is correct:

with open("Melbourne_bike_share.json", encoding="utf8") as mbs_json:
    mbs_string = mbs_json.readlines()
    mbs_string = [line.rstrip() for line in mbs_string]
    mbs_string = ''.join(mbs_string)

mbs = pd.read_json('Melbourne_bike_share.json')

But still get the same ValueError. I cannot find what is causing this error, what it really means, or pretty much any asked and answered questions involving reading json files that don't just suggest using the json module, which I cannot do.

I am new to python and json files. From what I gather, the next step after being able to read the json file is to flatten it:

from pandas.io.json import json_normalize    
df = json_normalize(mbs)

after which, I will have my dataframe and can manipulate that.

Post Edited to show the expected first line of the resultant dataframe. First line is the column headers (default indexes until I figure out how to pull the column headers from the metadata in the json file). I spaced the column headers 0-5 to align with their values, after that, the format here puts them on a new row. sorry about that. hope you can see that they should all align. Second line is the first row.

    0   1                                       2   3           4       5    6  7   8   9   10  11  12  13  14
0   155 7C09387D-9E6C-4B42-9041-9A98B88F54BB    155 1428899388  880594  1453945520  880594  {\n "invalidCells" : {\n "27624917" : "22/...   2   Harbour Town - Docklands Dve - Docklands    60000   9   14  1453985105  [{"address":"","city":"","state":"","zip":""},..

Any help would be appreciated.

Upvotes: 0

Views: 2250

Answers (1)

It_is_Chris
It_is_Chris

Reputation: 14063

I think you are looking for something like this:

import pandas as pd

df = pd.read_json('Melbourne_bike_share.json', typ='series')
pd.DataFrame(data=df['data'])

0   1   2   3   4   5   6   7   8   9   10  11  12  13  14
0   155 7C09387D-9E6C-4B42-9041-9A98B88F54BB    155 1428899388  880594  1453945520  880594  {\n "invalidCells" : {\n "27624917" : "22/...   2   Harbour Town - Docklands Dve - Docklands    60000   9   14  1453985105  [{"address":"","city":"","state":"","zip":""},...
1   156 52739A59-E034-436B-A613-E7A5F62448C0    156 1428899388  880594  1453945520  880594  {\n "invalidCells" : {\n "27624917" : "22/...   4   Federation Square - Flinders St / Swanston St ...   60001   11  11  1453985105  [{"address":"","city":"","state":"","zip":""},...
2   157 7EFB5219-2764-47CE-A497-B6D872823BBE    157 1428899388  880594  1453945520  880594  {\n "invalidCells" : {\n "27624917" : "22/...   5   Plum Garland Reserve - Beaconsfield Pde - Albe...   60002   16  1   1453985105  [{"address":"","city":"","state":"","zip":""},...
3   158 4B09D743-FFEE-4185-B968-A7866E45FE0B    158 1428899388  880594  1453945520  880594  {\n "invalidCells" : {\n "27624917" : "22/...   6   State Library - Swanston St / Little Lonsdale ...   60003   9   2   1453985105  [{"address":"","city":"","state":"","zip":""},...
4   159 C5493850-5714-40E2-9A67-DA443A4F64F8    159 1428899388  880594  1453945520  880594  {\n "invalidCells" : {\n "27624917" : "22/...   7   Bourke Street Mall - 205 Bourke St - City   60004   10  1   1453985105  [{"address":"","city":"","state":"","zip":""},...

if you want to add the column names then do something like this:

import pandas as pd

mbs = pd.read_json('Melbourne_bike_share.json', typ='series')
mbs_columns = mbs['meta']['view']['columns']
names = [item['name'] for item in mbs_columns]
df = pd.DataFrame(data=mbs['data'],columns=names)

Upvotes: 1

Related Questions