Reputation: 33
I'm trying to parse a huge 12 GB JSON file with almost 5 million lines(each one is an object) in python and store it to a database. I'm using ijson and multiprocessing in order to run it faster. Here is the code
def parse(paper):
global mydata
if 'type' not in paper["venue"]:
venue = Venues(venue_raw = paper["venue"]["raw"])
venue.save()
else:
venue = Venues(venue_raw = paper["venue"]["raw"], venue_type = paper["venue"]["type"])
venue.save()
paper1 = Papers(paper_id = paper["id"],paper_title = paper["title"],venue = venue)
paper1.save()
paper_authors = paper["authors"]
paper_authors_json = json.dumps(paper_authors)
obj = ijson.items(paper_authors_json,'item')
for author in obj:
mydata = mydata.append({'author_id': author["id"] , 'venue_raw': venue.venue_raw, 'year' : paper["year"],'number_of_times': 1},ignore_index=True)
if __name__ == '__main__':
p = Pool(4)
filename = 'C:/Users/dintz/Documents/finaldata/dblp.v12.json'
with open(filename,encoding='UTF-8') as infile:
papers = ijson.items(infile, 'item')
for paper in papers:
p.apply_async(parse,(paper,))
p.close()
p.join()
mydata = mydata.groupby(by=['author_id','venue_raw','year'], axis=0, as_index = False).sum()
mydata = mydata.groupby(by = ['author_id','venue_raw'], axis=0, as_index = False, group_keys = False).apply(lambda x: sum((1+x.year-x.year.min())*numpy.log10(x.number_of_times+1)))
df = mydata.index.to_frame(index = False)
df = pd.DataFrame({'author_id':df["author_id"],'venue_raw':df["venue_raw"],'rating':mydata.values[:,2]})
for index, row in df.iterrows():
author_id = row['author_id']
venue = Venues.objects.get(venue_raw = row['venue_raw'])
rating = Ratings(author_id = author_id, venue = venue, rating = row['rating'])
rating.save()
However I get the following error without knowing the reason
Can somebody help me?
Upvotes: 0
Views: 419
Reputation: 169416
I've had to make quite some extrapolations and assumptions, but it looks like
Populating your SQL database can be done pretty neatly with something like the following.
tqdm
package so you get a progress indication.PaperAuthor
model that links papers and authors.Venue
s in the database.get_or_create
and create
with stubs to make this runnable without the database models (or indeed, without Django), just having the dataset you're using available.On my machine, this consumes practically no memory, as the records are (or would be) dumped into the SQL database, not into an ever-growing, fragmenting dataframe in memory.
The Pandas processing is left as an exercise for the reader ;-), but I'd imagine it'd involve pd.read_sql()
to read this preprocessed data from the database.
import multiprocessing
import ijson
import tqdm
def get_or_create(model, **kwargs):
# Actual Django statement:
# return model.objects.get_or_create(**kwargs)
return (None, True)
def create(model, **kwargs):
# Actual Django statement:
# return model.objects.create(**kwargs)
return None
Venue = "Venue"
Paper = "Paper"
PaperAuthor = "PaperAuthor"
def parse(paper):
venue_name = paper["venue"]["raw"]
venue_type = paper["venue"].get("type")
venue, _ = get_or_create(Venue, venue_raw=venue_name, venue_type=venue_type)
paper_obj = create(Paper, paper_id=paper["id"], paper_title=paper["title"], venue=venue)
for author in paper["authors"]:
create(PaperAuthor, paper=paper_obj, author_id=author["id"], year=paper["year"])
def main():
filename = "F:/dblp.v12.json"
with multiprocessing.Pool() as p, open(filename, encoding="UTF-8") as infile:
for result in tqdm.tqdm(p.imap_unordered(parse, ijson.items(infile, "item"), chunksize=64)):
pass
if __name__ == "__main__":
main()
Upvotes: 1