horatio1701d
horatio1701d

Reputation: 9159

Combinging Multiple Json Objects as one DataFrame in Python Pandas

I'm not sure what I'm missing here but I have 2 zip files that contain json files and I'm just trying to combine the data I extract from the files and combine as one dataframe but my loop keeps giving me separate records. Here is what I have prior to constructing DF. I tried pd.concat but I think my issue is more to do with the way I'm reading the files in the first place.

data = []
for FileZips in glob.glob('*.zip'):
    with zipfile.ZipFile(FileZips, 'r') as myzip:
        for logfile in myzip.namelist():
            with myzip.open(logfile) as f:
                contents = f.readlines()[-2]
                jfile = json.loads(contents)
                print len(jfile)

returns:

40935 
40935

Upvotes: 0

Views: 1985

Answers (2)

Andy Hayden
Andy Hayden

Reputation: 375435

You can use read_json (assuming it's valid json).

I would also break this up into more functions for readability:

def zip_to_df(zip_file):
    with zipfile.ZipFile(zip_file, 'r') as myzip:
        return pd.concat((log_as_df(loglife, myzip)
                             for logfile in myzip.namelist()),
                         ignore_index=True)

def log_as_df(logfile, myzip):
    with myzip.open(logfile, 'r') as f:
        contents = f.readlines()[-2]
        return pd.read_json(contents)

df = pd.concat(map(zip_to_df, glob.glob('*.zip')), ignore_index=True)

Note: This does more concats, but I think it's worth it for readability, you could do just one concat...

Upvotes: 2

horatio1701d
horatio1701d

Reputation: 9159

I was able to get what I need with a small adjustment to my indent!!

dfs = []
for FileZips in glob.glob('*.zip'):
    with zipfile.ZipFile(FileZips, 'r') as myzip:
        for logfile in myzip.namelist():
            with myzip.open(logfile, 'r') as f:
                contents = f.readlines()[-2]
                jfile = json.loads(contents)
                dfs.append(pd.DataFrame(jfile))
                df = pd.concat(dfs, ignore_index=True)
print len(df) 

Upvotes: 2

Related Questions