AhmedKamal2021
AhmedKamal2021

Reputation: 47

reading large json file with panda

I am trying to load books reviews from this page https://nijianmo.github.io/amazon/index.html I downloaded the file to extract it but when I try to read it with pandas I get memory error

pd.read_json('path/Books_5.json',lines=True)

I tried other files that are smaller and it worked I am doing sentiment analysis I need 250k reviews with scores of 4,5 and 250k reviews with scores 1,2.

i tried to use this to check for the score and take the text into lists to make a data frame with them later

with pd.read_json('path/Books_5.json',lines=True,chunksize= 1) as reader:
for chunk in reader:
    if chunk[chunk['overall'] > 3]:
        pos_revs.append(chunk['reviewText'])
    elif chunk[chunk['overall'] < 3]:
        neg_revs.append(chunk['reviewText'])
    if (len(pos_revs) == 250000) & (len(neg_revs) == 250000):
        break

but i got the error

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

looking at similar questions json files are not the same this how mine looks likeenter image description here

Upvotes: 0

Views: 227

Answers (1)

BeRT2me
BeRT2me

Reputation: 13241

Loading lines one at a time into DataFrames just to check their rating is incredibly inefficient, it's better to treat everything as dictionaries and make some Series at the end.

import json
import gzip
import pandas as pd

def parse(path):
    g = gzip.open(path, 'r')
    for l in g:
        yield json.loads(l)

file = parse('Books_5.json.gz')
pos_revs = []
neg_revs = []
while len(pos_revs) < 250000 or len(neg_revs) < 250000:
    line = next(file)
    rating = line['overall']        
    if len(pos_revs) < 250000 and rating > 3:
        review = line.get('reviewText')
        if review:
            pos_revs.append(review)
    if len(neg_revs) < 250000 and rating < 3:
        review = line.get('reviewText')
        if review:
            neg_revs.append(line.get('reviewText'))

pos_revs = pd.Series(pos_revs)
neg_revs = pd.Series(neg_revs)

print(pos_revs)
print(neg_revs)

Output:

0         The King, the Mice and the Cheese by Nancy Gur...
1                                        The kids loved it!
2         My students (3 & 4 year olds) loved this book!...
3                                                   LOVE IT
4                                                    Great!
                                ...
249995               Great read. Dis t want to put it down.
249996                                     Love this series
249997    So I am one of those people who absolutely lov...
249998    I learned a great deal from this book. The Fre...
249999    Having already read Tuchman's book on the outb...
Length: 250000, dtype: object

0         Looking for a Louis Untermeyer book  from the ...
1         Completly boring!!! Yes it's a childerns book ...
2         I don't like Hillerman novels.  It was chosen ...
3         I have read many of the Hillerman books and en...
4         I really love Hillerman's books.  He is one of...
                                ...
249995    When I first started reading SUSPECT, I though...
249996    I really despised this book.  Sure it portrays...
249997    This is a bleak novel. The mindless violence t...
249998    Like the title says, this is not as good as Ch...
249999    Great concept. Predictable, poorly written sto...
Length: 250000, dtype: object

Or a purely pandas version could look something like this, and possibly be faster:

reader = pd.read_json('Books_5.json.gz',lines=True, chunksize=100000)
pos_revs = pd.DataFrame()
neg_revs = pd.DataFrame()
for chunk in reader:
    if pos := (len(pos_revs) < 250000):
        temp_pos = chunk[chunk['overall'].gt(3)][['summary']]
        pos_revs = pd.concat([pos_revs, temp_pos], ignore_index=True)
        # OPTIONAL:
        pos_revs.drop_duplicates(inplace=True, ignore_index=True)
    if neg := (len(neg_revs) < 250000):
        temp_neg = chunk[chunk['overall'].lt(3)][['summary']]
        neg_revs = pd.concat([neg_revs, temp_neg], ignore_index=True)
        # OPTIONAL:
        neg_revs.drop_duplicates(inplace=True, ignore_index=True)
    if not neg and not pos:
        break

print(pos_revs)
print(neg_revs)

Output:

                                                  summary
0               A story children will love and learn from
1                                              Five Stars
2                                           Not Nice Mice
3                        One of my favorite kids' stories
4                   One of our families favorite books!!!
...                                                   ...
294397  PRATCHETT ON TOP FORM WITH THIS BRILLIANT NEW ...
294398        Pratchett's aphorisms get better and better
294399  Thief of Time - John Deakins for ABSOLUTE MAGN...
294400                           An absolute masterpiece!
294401                     Beautiful, Engaging, A Classic

[294402 rows x 1 columns]

                                                  summary
0                                               Two Stars
1                                  Don't waste your money
2                                    Tony missed the mark
3                              Don't Start with This One!
4                                         Nothing special
...                                                   ...
253377                         Not the caliber of "Naked"
253378                 "Weak Stories" b/w "One Ace Essay"
253379      Fast service - product smells of mildew/mold.
253380  Nothing new, classical narration, good against...
253381  This is a magnificent book-hardcover version b...

[253382 rows x 1 columns]

I'm not sure which method is faster, but both take less than a minute to run on the 6GB file.

Upvotes: 1

Related Questions