Reputation: 17
I am new to python and having trouble with something that seems conceptually very simple. I've read a number of SO posts but still can't solve my problem(s).
I have a function to convert amazon reviews to json format. Each review becomes a single json object. I would like to compile all reviews in a single dataframe, with the json keys as columns and each review in a row.
There are a large number of reviews, each formatted like so:
{
"product/productId": "B00006HAXW",
"product/title": "Winnie the Pooh",
"product/price": "unknown",
"review/userId": "A1RSDE90N6RSZF",
"review/profileName": "piglet",
"review/helpfulness": "9/9",
"review/score": "5.0",
"review/time": "1042502400",
"review/summary": "Love this book",
"review/text" : "Exciting stories about highly intelligent creatures, very inspiring!"
}
How can I compile all reviews into a pandas dataframe? I'm having two separate problems:
How do I compile all reviews in one object? Currently, the output is generated like so:
for e in parse("reviews.txt.gz"):
print json.dumps(e)
I tried creating an empty list
and using append
:
for e in parse("reviews.txt.gz"):
revs = []
revs = revs.append(json.dumps(e))
but that does not work - print revs
prints out
None
None
None
pd.read_json
on a single review formatted as above, it returns "If using all scalar values, you must must pass an index". Does this mean I do not have valid json format data?Upvotes: 1
Views: 4334
Reputation: 19104
json.dumps()
on the data as this returns a string and you can pass python objects to Pandas.Your for loop should look like
revs = []
for e in parse("reviews.txt.gz"):
revs = revs.append(e)
But unless parse is a generator (ie. uses the yield
keyword), you can just set revs = parse("reviews.txt.gz")
pd.read_json
attempts to parse the json as a DataFrame... If you only have one column then, this will throw an error as it expects the data to be doubly indexed.So if revs
is now a list of strings (ie. your parse function returns json representations of the data), you can call
df = pd.read_json(revs)
Otherwise if revs is now a list of dictionaries (ie. your parse function has already interpreted the json and returns dictionaries of the data), you can call
df = pd.DataFrame(revs)
Upvotes: 1