Reputation: 1551
I have very huge text file ( around 80G ). File contains only numbers(integers+floats) and has 20 columns. Now I have to analyze each column. By analyze I mean, I have to do some basic calculations on each column like finding mean, plotting histograms, check if condition is satisfied or not etc. I am reading file like following
with open(filename) as original_file:
all_rows = [[float(digit) for digit in line.split()] for line in original_file]
all_rows = np.asarray(all_rows)
After this I do all analysis on specific columns. I use 'good' configuration server/workstation (with 32GB RAM) to execute my program. Problem is that I am not able to finish my job. I waited almost day to finish that program but it was still running after 1 day. I had to kill it manually later on. I know my script is correct without any error because I have tried same script on smaller size files (around 1G) and it worked nicely.
My initial guess is it will have some memory problem. Is there any way I can run such job? Some different method or some other way ?
I tried splitting files into smaller file size and then analyzed them individually in loop like follows
pre_name = "split_file"
for k in range(11): #There are 10 files with almost 8G each
filename = pre_name+str(k).zfill(3) #My files are in form "split_file000, split_file001 ..."
with open(filename) as original_file:
all_rows = [[float(digit) for digit in line.split()] for line in original_file]
all_rows = np.asarray(all_rows)
#Some analysis here
plt.hist(all_rows[:,8],100) #Plotting histogram for 9th Column
all_rows = None
I have tested above code on bunch of smaller files and it works fine. However again it was same problem when I used on big files. Any suggestions? Is there any other cleaner way to do it ?
Upvotes: 1
Views: 804
Reputation: 32542
Two alternatives come to my mind:
You should consider performing your computation by online algorithms
In computer science, an online algorithm is one that can process its input piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start.
It is possible to compute mean and variance and a histogram with pre-specified bins this way with constant memory complexity.
You should throw your data into a proper database and make use of that database system's statistical and data handling capabilities, e.g., aggregate functions and indexes.
Random links:
Upvotes: 2
Reputation: 2093
For such lengthy operations (when data don't fit in memory), it might be useful to use libraries like dask ( http://dask.pydata.org/en/latest/ ), particularly the dask.dataframe.read_csv
to read the data and then perform your operations like you would do in pandas library (another useful package to mention).
Upvotes: 3