Patrick
Patrick

Reputation: 171

Long text file (186 million lines) takes up too much space when parsed into a table format

I am running a simulation in Python3.7 that outputs a log file. This log file contains information for 4 columns that I want to extract ('Rank', 'Particle', 'Distance', 'Time'), however, the file is so long (~186 million rows) and cannot be converted into a table without the memory exploding.

There is a lot of information in the log file that is superfluous (i.e. lots of rows I don't want). This data represents test bodies having close encounters with the planet Jupiter, and I would only like to take the closest point of any particle's encounter path (so when the distance is minimized).

I wanted to know how I could sequentially parse through the array, loading and subsequently closing a subset of rows each time, and determine which rows should be deleted. That way to avoid a memory error.

Here is a sample of the log file:

> INFO:root:Rank: 9; Particle: 11; Distance: 0.9091072240849053; Time:
> -16.313304965974524 INFO:root:Rank: 9; Particle: 12; Distance: 1.0044817868831895; Time: -16.313304965974524 INFO:root:Rank: 9; Particle: 11; Distance: 0.908626047054527; Time: -16.313713653638327
> INFO:root:Rank: 9; Particle: 12; Distance: 1.0039465102430458; Time:
> -16.313713653638327 INFO:root:Rank: 9; Particle: 11; Distance: 0.9080831675466843; Time: -16.31417484234347 INFO:root:Rank: 9; Particle: 12; Distance: 1.003342787368617; Time: -16.31417484234347
> INFO:root:Rank: 9; Particle: 11; Distance: 0.9075612522257289; Time:
> -16.314618315598103 INFO:root:Rank: 9; Particle: 12; Distance: 1.0027625719975715; Time: -16.314618315598103 INFO:root:Rank: 9; Particle: 11; Distance: 0.9071397102705921; Time: -16.3149765686745
> INFO:root:Rank: 9; Particle: 12; Distance: 1.0022940809354668; Time:
> -16.3149765686745 INFO:root:Rank: 9; Particle: 17; Distance: 1.0138064947281393; Time: -16.3149765686745 INFO:root:Rank: 9; Particle: 11; Distance: 0.9068825428781885; Time: -16.31519515543922
> INFO:root:Rank: 9; Particle: 12; Distance: 1.0020083325953948; Time:
> -16.31519515543922 INFO:root:Rank: 9; Particle: 17; Distance: 1.013519683237125; Time: -16.31519515543922 INFO:root:Rank: 9; Particle: 11; Distance: 0.9094533423012789; Time: -16.31301103889693
> INFO:root:Rank: 9; Particle: 12; Distance: 1.004866919381637; Time:
> -16.31301103889693 INFO:root:Rank: 9; Particle: 11; Distance: 0.9091072240849053; Time: -16.313304965974524 INFO:root:Rank: 9; Particle: 12; Distance: 1.0044817868831895; Time: -16.313304965974524
> INFO:root:Rank: 9; Particle: 11; Distance: 0.908626047054527; Time:
> -16.313713653638327 INFO:root:Rank: 9; Particle: 12; Distance: 1.0039465102430458; Time: -16.313713653638327 INFO:root:Rank: 9; Particle: 11; Distance: 0.9080831675466843; Time: -16.31417484234347
> INFO:root:Rank: 9; Particle: 12; Distance: 1.003342787368617; Time:
> -16.31417484234347 INFO:root:Rank: 9; Particle: 11; Distance: 0.9075612522257289; Time: -16.314618315598103 INFO:root:Rank: 9; Particle: 12; Distance: 1.0027625719975715; Time: -16.314618315598103
> INFO:root:Rank: 9; Particle: 11; Distance: 0.9071397102705921; Time:
> -16.3149765686745 INFO:root:Rank: 9; Particle: 12; Distance: 1.0022940809354668; Time: -16.3149765686745 INFO:root:Rank: 9; Particle: 17; Distance: 1.0138064947281393; Time: -16.3149765686745
> INFO:root:Rank: 9; Particle: 11; Distance: 0.9068825428781885; Time:
> -16.31519515543922 INFO:root:Rank: 9; Particle: 12; Distance: 1.0020083325953948; Time: -16.31519515543922 INFO:root:Rank: 9; Particle: 17; Distance: 1.013519683237125; Time: -16.31519515543922
> INFO:root:Rank: 9; Particle: 11; Distance: 0.9068198463831555; Time:
> -16.31524844951857 INFO:root:Rank: 9; Particle: 12; Distance: 1.0019386751793453; Time: -16.31524844951857 INFO:root:Rank: 9; Particle: 17; Distance: 1.0134497671630922; Time: -16.31524844951857
> INFO:root:Rank: 9; Particle: 11; Distance: 0.9066701792148222; Time:
> -16.315375676922567 INFO:root:Rank: 9; Particle: 12; Distance: 1.00177240223002; Time: -16.315375676922567 INFO:root:Rank: 9; Particle: 17; Distance: 1.013282877600642; Time: -16.315375676922567
> INFO:root:Rank: 9; Particle: 11; Distance: 0.9063404096803097; Time:
> -16.315656030600657 INFO:root:Rank: 9; Particle: 12; Distance: 1.0014060996373213; Time: -16.315656030600657 INFO:root:Rank: 9; Particle: 15; Distance: 1.0137165581155958; Time: -16.315656030600657
> INFO:root:Rank: 9; Particle: 17; Distance: 1.012915220608835; Time:
> -16.315656030600657 INFO:root:Rank: 9; Particle: 11; Distance: 0.9058819575130683; Time: -16.316045845280794 INFO:root:Rank: 9; Particle: 12; Distance: 1.000896985053485; Time: -16.316045845280794
> INFO:root:Rank: 9; Particle: 15; Distance: 1.0132054747127601; Time:
> -16.316045845280794 INFO:root:Rank: 9; Particle: 17; Distance: 1.0124042327584963; Time: -16.316045845280794 INFO:root:Rank: 9; Particle: 11; Distance: 0.9053647124033892; Time: -16.316485736531497
> INFO:root:Rank: 9; Particle: 12; Distance: 1.000322757426278; Time:
> -16.316485736531497 INFO:root:Rank: 9; Particle: 15; Distance: 1.0126290399058455; Time: -16.316485736531497 INFO:root:Rank: 9; Particle: 17; Distance: 1.0118279051195338; Time: -16.316485736531497
> INFO:root:Rank: 9; Particle: 11; Distance: 0.9048674370339668; Time:
> -16.31690873042198 INFO:root:Rank: 9; Particle: 12; Distance: 0.9997708766377388; Time: -16.31690873042198 INFO:root:Rank: 9; Particle: 15; Distance: 1.012075051289847; Time: -16.31690873042198
> INFO:root:Rank: 9; Particle: 17; Distance: 1.011274018895163; Time:
> -16.31690873042198 INFO:root:Rank: 9; Particle: 11; Distance: 0.9044657930933018; Time: -16.317250439557714 INFO:root:Rank: 9; Particle: 12; Distance: 0.9993252554048654; Time: -16.317250439557714

And here is what originally wrote to turn it into a table (before I realized how long it was):

def ce_log_to_table(log_file):
    with open(log_file) as f:
        lines = f.readlines()

    ranks = []
    indices = []
    distances = []
    times = []

    for line in lines:
        rank = re.search('(?!Rank: )[0-9]*(?=; P)', line)
        index = re.search('(?!Particle: )[0-9]*(?=; D)', line)
        distance = re.search('(?!Distance: )[0-9.0-9]*(?=; T)', line)
        time = re.search('(?!Time: )-[0-9.0-9]*', line)

        ranks.append(rank[0])
        indices.append(index[0])
        distances.append(distance[0])
        times.append(time[0])

    ce_dict = {'rank': ranks, 'index': indices, 'distance': distances, 'time': times}
    df = pd.DataFrame(ce_dict)

    return df

Side note: File viewer GUI says that file is 26 MB, but when using du command in terminal the file is actually 16 GB! Not sure why the GUI messed up?

Upvotes: 1

Views: 80

Answers (2)

David Erickson
David Erickson

Reputation: 16683

I would use dask, the big data tool big brother of pandas (note: I renamed some of your objects as you shouldn't use names like index or time as they can mess with built-in objects):

import dask.dataframe as dd
logfile = 'Desktop\dd.txt'
df = dd.read_csv(logfile, header=None)
df

def ce_log_to_table(df):    
    ranks = []
    indices = []
    distances = []
    times = []

    for line in df[0]:
        rnk = re.search('(?!Rank: )[0-9]*(?=; P)', line)
        idx = re.search('(?!Particle: )[0-9]*(?=; D)', line)
        dstnc = re.search('(?!Distance: )[0-9.0-9]*(?=; T)', line)
        t = re.search('(?!Time: )-[0-9.0-9]*', line)

        ranks.append(rnk[0])
        indices.append(idx[0])
        distances.append(dstnc[0])
        times.append(t[0])

    ce_dict = {'rank': ranks, 'index': indices, 'distance': distances, 'time': times}
    df = pd.DataFrame(ce_dict)
    return df


ce_log_to_table(df).to_csv('dask_test.txt')

Upvotes: 1

GLaw1300
GLaw1300

Reputation: 205

You could just wrap your for loop (for line in lines) in another for loop (for i in range(x) (where x is the number of chunks you'd like to separate lines into) and then iterate over lines[::x]

So something like:

for i in range(1000): # separate lines into 1000 chunks
    for line in lines[::1000]: # select every 1000th value in lines
        # do stuff here
        yield df # if this is what you want to do (see below)

Then, if you want to return the DataFrame, you'd yield the constructed DataFrame for each chunk and outside the function process the DataFrame one chunk at a time.

Upvotes: 0

Related Questions