How can i extract information quickly from 130,000+ Json files located in S3?

i have an S3 was over 130k Json Files which i need to calculate numbers based on data in the json files (for example calculate the number of gender of Speakers). i am currently using s3 Paginator and JSON.load to read each file and extract information form. but it take a very long time to process such a large number of file (2-3 files per second). how can i speed up the process? please provide working code examples if possible. Thank you here is some of my code:

client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket='bucket-name',StartAfter='') 

for page in result:
    if "Contents" in page:
        for key in page[ "Contents" ]:
            keyString = key[ "Key" ]
                s3 = boto3.resource('s3')
                content_object = s3.Bucket('bucket-name').Object(str(keyString))
                    file_content = content_object.get()['Body'].read().decode('utf-8')
                    json_content = json.loads(file_content)
                    x = (json_content['dict-name'])

Upvotes: 1

Answers (1)

Jonathan Leon

Reputation: 5648

In order to use the code below, I'm assuming you understand pandas (if not, you may want to get to know it). Also, it's not clear if your 2-3 seconds is on the read or includes part of the number crunching, nonetheless multiprocessing will speed this up dramatically. The gist is to read all the files in (as dataframes), concatenate them, then do your analysis.

To be useful for me, I run this on spot instances that have lots of vCPUs and memory. I've found the instances that are network optimized (like c5n - look for the n) and the inf1 (for machine learning) are much faster at reading/writing than T or M instance types, as examples.

My use case is reading 2000 'directories' with roughly 1200 files in each and analyzing them. The multithreading is orders of magnitude faster than single threading.

File 1: your main script

# create script.py file
import os
from multiprocessing import Pool
from itertools import repeat
import pandas as pd
import json
from utils_file_handling import *

ufh = file_utilities() #instantiate the class functions - see below (second file)

bucket = 'your-bucket'
prefix = 'your-prefix/here/' # if you don't have a prefix pass '' (empty string or function will fail)

#define multiprocessing function - get to know this to use multiple processors to read files simultaneously
def get_dflist_multiprocess(keys_list, num_proc=4):
    with Pool(num_proc) as pool:
        df_list = pool.starmap(ufh.reader_json, zip(repeat(bucket), keys_list), 15)
        pool.close()
        pool.join()
    return df_list

#create your master keys list upfront; you can loop through all or slice the list to test
keys_list = ufh.get_keys_from_prefix(bucket, prefix)
# keys_list = keys_list[0:2000] # as an exampmle

num_proc = os.cpu_count() #tells you how many processors your machine has; function above defaults to 4 unelss given
df_list = get_dflist_multiprocess(keys_list, num_proc=num_proc) #collect dataframes for each file
df_new = pd.concat(df_list, sort=False) 
df_new = df_new.reset_index(drop=True)
# do your analysis on the dataframe

File 2: class functions

#utils_file_handling.py
# create this in a separate file; name as you wish but change the import in the script.py file
import boto3
import json
import pandas as pd   

#define client and resource
s3sr = boto3.resource('s3')
s3sc = boto3.client('s3')

class file_utilities:
    """file handling function"""

    def get_keys_from_prefix(self, bucket, prefix):
        '''gets list of keys and dates for given bucket and prefix'''
        keys_list = []
        paginator = s3sr.meta.client.get_paginator('list_objects_v2')
        # use Delimiter to limit search to that level of hierarchy
        for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
            keys = [content['Key'] for content in page.get('Contents')]
            print('keys in page: ', len(keys))
            keys_list.extend(keys)
        return keys_list

    def read_json_file_from_s3(self, bucket, key):
        """read json file"""
        bucket_obj = boto3.resource('s3').Bucket(bucket)
        obj = boto3.client('s3').get_object(Bucket=bucket, Key=key)
        data = obj['Body'].read().decode('utf-8')
        return data

    # you may need to tweak this for your ['dict-name'] example; I think I have it correct
    def reader_json(self, bucket, key):
        '''returns dataframe'''
        return pd.DataFrame(json.loads(self.read_json_file_from_s3(bucket, key))['dict-name'])

Upvotes: 1

How can i extract information quickly from 130,000+ Json files located in S3?

Answers (1)

Related Questions