saraw
saraw

Reputation: 17

convert multiple json objects to pandas dataframe

I am new to python and having trouble with something that seems conceptually very simple. I've read a number of SO posts but still can't solve my problem(s).

I have a function to convert amazon reviews to json format. Each review becomes a single json object. I would like to compile all reviews in a single dataframe, with the json keys as columns and each review in a row.

There are a large number of reviews, each formatted like so:

{
"product/productId": "B00006HAXW",
"product/title": "Winnie the Pooh",
"product/price": "unknown",
"review/userId": "A1RSDE90N6RSZF",
"review/profileName": "piglet",
"review/helpfulness": "9/9",
"review/score": "5.0",
"review/time": "1042502400",
"review/summary": "Love this book", 
"review/text" : "Exciting stories about highly intelligent creatures, very inspiring!"
}

How can I compile all reviews into a pandas dataframe? I'm having two separate problems:

  1. How do I compile all reviews in one object? Currently, the output is generated like so:

    for e in parse("reviews.txt.gz"):
        print json.dumps(e)
    

I tried creating an empty list and using append:

    for e in parse("reviews.txt.gz"):
        revs = []
        revs = revs.append(json.dumps(e))

but that does not work - print revs prints out

None
None
None 
  1. When I use pd.read_json on a single review formatted as above, it returns "If using all scalar values, you must must pass an index". Does this mean I do not have valid json format data?

Upvotes: 1

Views: 4334

Answers (1)

Alex
Alex

Reputation: 19104

  1. There is no need to call json.dumps() on the data as this returns a string and you can pass python objects to Pandas.

Your for loop should look like

revs = []
for e in parse("reviews.txt.gz"):
    revs = revs.append(e)

But unless parse is a generator (ie. uses the yield keyword), you can just set revs = parse("reviews.txt.gz")

  1. pd.read_json attempts to parse the json as a DataFrame... If you only have one column then, this will throw an error as it expects the data to be doubly indexed.

So if revs is now a list of strings (ie. your parse function returns json representations of the data), you can call

df = pd.read_json(revs)

Otherwise if revs is now a list of dictionaries (ie. your parse function has already interpreted the json and returns dictionaries of the data), you can call

df = pd.DataFrame(revs)

Upvotes: 1

Related Questions