Memory error while parsing huge JSON file

Question

I'm trying to parse a huge 12 GB JSON file with almost 5 million lines(each one is an object) in python and store it to a database. I'm using ijson and multiprocessing in order to run it faster. Here is the code

def parse(paper):
    global mydata
 
    if 'type' not in paper["venue"]:
        venue = Venues(venue_raw = paper["venue"]["raw"])
        venue.save()
    else:
        venue = Venues(venue_raw = paper["venue"]["raw"], venue_type = paper["venue"]["type"])
        venue.save()
    paper1 = Papers(paper_id = paper["id"],paper_title = paper["title"],venue = venue)
    paper1.save()
            
    paper_authors = paper["authors"]
    paper_authors_json = json.dumps(paper_authors)
    obj = ijson.items(paper_authors_json,'item')
    for author in obj:
        mydata = mydata.append({'author_id': author["id"] , 'venue_raw': venue.venue_raw, 'year' : paper["year"],'number_of_times': 1},ignore_index=True)

if __name__ == '__main__':
    p = Pool(4)
 
    filename = 'C:/Users/dintz/Documents/finaldata/dblp.v12.json'
    with open(filename,encoding='UTF-8') as infile:
        papers = ijson.items(infile, 'item')   
        for paper in papers:
            p.apply_async(parse,(paper,))
    
            
    
    p.close()
    p.join()
            
    
 
    mydata = mydata.groupby(by=['author_id','venue_raw','year'], axis=0, as_index = False).sum()
    mydata = mydata.groupby(by = ['author_id','venue_raw'], axis=0, as_index = False, group_keys = False).apply(lambda x: sum((1+x.year-x.year.min())*numpy.log10(x.number_of_times+1)))
    df = mydata.index.to_frame(index = False)
    df = pd.DataFrame({'author_id':df["author_id"],'venue_raw':df["venue_raw"],'rating':mydata.values[:,2]})
    
    for index, row in df.iterrows():
        author_id = row['author_id']
        venue = Venues.objects.get(venue_raw = row['venue_raw'])
        rating = Ratings(author_id = author_id, venue = venue, rating = row['rating'])
        rating.save()

However I get the following error without knowing the reason

Can somebody help me?

AKX · Accepted Answer

I've had to make quite some extrapolations and assumptions, but it looks like

you're using Django
you want to populate an SQL database with venue, paper and author data
you want to then do some analysis using Pandas

Populating your SQL database can be done pretty neatly with something like the following.

I added the tqdm package so you get a progress indication.
This assumes there's a PaperAuthor model that links papers and authors.
Unlike the original code, this will not save duplicate Venues in the database.
You can see I replaced get_or_create and create with stubs to make this runnable without the database models (or indeed, without Django), just having the dataset you're using available.

On my machine, this consumes practically no memory, as the records are (or would be) dumped into the SQL database, not into an ever-growing, fragmenting dataframe in memory.

The Pandas processing is left as an exercise for the reader ;-), but I'd imagine it'd involve pd.read_sql() to read this preprocessed data from the database.

import multiprocessing

import ijson
import tqdm


def get_or_create(model, **kwargs):
    # Actual Django statement:
    # return model.objects.get_or_create(**kwargs)
    return (None, True)


def create(model, **kwargs):
    # Actual Django statement:
    # return model.objects.create(**kwargs)
    return None


Venue = "Venue"
Paper = "Paper"
PaperAuthor = "PaperAuthor"


def parse(paper):
    venue_name = paper["venue"]["raw"]
    venue_type = paper["venue"].get("type")
    venue, _ = get_or_create(Venue, venue_raw=venue_name, venue_type=venue_type)
    paper_obj = create(Paper, paper_id=paper["id"], paper_title=paper["title"], venue=venue)
    for author in paper["authors"]:
        create(PaperAuthor, paper=paper_obj, author_id=author["id"], year=paper["year"])


def main():
    filename = "F:/dblp.v12.json"
    with multiprocessing.Pool() as p, open(filename, encoding="UTF-8") as infile:
        for result in tqdm.tqdm(p.imap_unordered(parse, ijson.items(infile, "item"), chunksize=64)):
            pass


if __name__ == "__main__":
    main()

Memory error while parsing huge JSON file

Answers (1)

Related Questions