Dexter
Dexter

Reputation: 1551

Analyzing huge dataset from 80GB file with only 32GB of RAM

I have very huge text file ( around 80G ). File contains only numbers(integers+floats) and has 20 columns. Now I have to analyze each column. By analyze I mean, I have to do some basic calculations on each column like finding mean, plotting histograms, check if condition is satisfied or not etc. I am reading file like following

with open(filename) as original_file:
        all_rows = [[float(digit) for digit in line.split()] for line in original_file]
    all_rows = np.asarray(all_rows)

After this I do all analysis on specific columns. I use 'good' configuration server/workstation (with 32GB RAM) to execute my program. Problem is that I am not able to finish my job. I waited almost day to finish that program but it was still running after 1 day. I had to kill it manually later on. I know my script is correct without any error because I have tried same script on smaller size files (around 1G) and it worked nicely.

My initial guess is it will have some memory problem. Is there any way I can run such job? Some different method or some other way ?

I tried splitting files into smaller file size and then analyzed them individually in loop like follows

pre_name = "split_file"   
for k in range(11):  #There are 10 files with almost 8G each
        filename = pre_name+str(k).zfill(3) #My files are in form "split_file000, split_file001 ..."
        with open(filename) as original_file:
            all_rows = [[float(digit) for digit in line.split()] for line in original_file]
        all_rows = np.asarray(all_rows)
        #Some analysis here
        plt.hist(all_rows[:,8],100)  #Plotting histogram for 9th Column
all_rows = None

I have tested above code on bunch of smaller files and it works fine. However again it was same problem when I used on big files. Any suggestions? Is there any other cleaner way to do it ?

Upvotes: 1

Views: 804

Answers (2)

moooeeeep
moooeeeep

Reputation: 32542

Two alternatives come to my mind:

  • You should consider performing your computation by online algorithms

    In computer science, an online algorithm is one that can process its input piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start.

    It is possible to compute mean and variance and a histogram with pre-specified bins this way with constant memory complexity.

  • You should throw your data into a proper database and make use of that database system's statistical and data handling capabilities, e.g., aggregate functions and indexes.

    Random links:

Upvotes: 2

honza_p
honza_p

Reputation: 2093

For such lengthy operations (when data don't fit in memory), it might be useful to use libraries like dask ( http://dask.pydata.org/en/latest/ ), particularly the dask.dataframe.read_csv to read the data and then perform your operations like you would do in pandas library (another useful package to mention).

Upvotes: 3

Related Questions