Reputation: 199
I'm having issues with Pandas taking up way too much RAM. I've got a file of 5.5gb with 2 columns of which I want to simply save all the unique values in the first column like so:
Main File
Follower Friend
0 12 260009730
1 12 17568791
2 12 22512883
3 12 15808761
4 12 10135072
5 12 988
6 12 22424855
7 13 9163182
8 14 22990962
9 15 7681662
10 15 17289517
to
Result File
User
0 12
1 13
2 14
3 15
because of RAM limitations I'm importing the main file in pieces of 30, trying to purge the dataframes from memory and only appending the result file each time. After two iterations (out of thirty) the result file is 13.5mb. But it consistently crashes after the 6th iteration and I can see in my process management that python is taking up 4.5gb of RAM. I'm trying to call the garbage collector but apparantly it's not working, can you guys help me out? My code is as follows:
i = 0
userRelation = pd.DataFrame(columns=['User'])
Location = 'file.txt'
while i < 30:
userRelationHelp = pd.DataFrame(columns=['User'])
print(str(i))
network = pd.read_csv(Location, sep="\t", header=None, encoding='Latin', low_memory=False, skiprows=(i * math.ceil(284884514/30)), nrows=(((i+1) * math.ceil(284884514/30))), names=['Follower', 'Friend'])
userRelationHelp['User'] = network['Follower'].unique()
userRelation = userRelation.append(userRelationHelp)
lst = [userRelationHelp, network]
del lst
gc.collect()
i += 1
From what I've read the last 3 lines before i += 1 should serve to purge the larger files from memory. After each iteration I can see my RAM used at the start of the cycle constantly increasing by ~200mb, and during the cycle it'll increase by more each run.
Base Python RAM usage before running above code: 76mb
Approximate Python RAM usage at start of cycle
0: 300
1: 800
2: 1000
3: 1300
Approximate Python RAM usage at end of cycle
0: 633
1: 2000
2: 2900
3: 3700
Can imagine point out what I'm doing or assuming incorrectly?
Upvotes: 1
Views: 761
Reputation: 210812
@ypnos gave you a perfect description of how it should be done in Pandaic way.
Here is my attempt to do code it:
Generate sample data and write it to CSV:
fn = r'c:/temp/data.csv'
pd.DataFrame(np.random.randint(1001, 9999, (10**5, 2))) \
.to_csv(fn, sep='\t', index=False, header=None)
processing only those columns from the CSV that we need:
chunksize=10**2 # you may want to use 10**7 as a chunk size
reader = pd.read_csv(fn, sep='\t', usecols=[0], names=['Follower'],
chunksize=chunksize, squeeze=True)
df = pd.DataFrame(np.unique(np.concatenate([ser.unique() for ser in reader])),
columns=['User'])
Result:
In [62]: df
Out[62]:
User
0 1001
1 1002
2 1003
3 1004
4 1005
5 1006
6 1007
7 1008
8 1009
9 1010
... ...
8988 9989
8989 9990
8990 9991
8991 9992
8992 9993
8993 9994
8994 9995
8995 9996
8996 9997
8997 9998
[8998 rows x 1 columns]
Upvotes: 3
Reputation: 52317
Your way of splitting the huge file is extremely inefficient. When you use the skiprows
parameter, the reader needs to go through the file line by line, counting line-ending characters, until the requested amount of rows has passed. So in your final iteration, the whole file is read again after you crunched through most of it multiple times already.
Please note in passing that your usage of nrows
looks wrong to me, too. I would expect nrows=math.ceil(284884514/30)
. I believe this is the explanation to your memory problem.
What you need instead is an iterative way of running through the file at once, which is provided by the chunksize
parameter. The use of chunksize
is very well illustrated in the Pandas IO Tools docs.
Illustratory code example:
userRelation = pd.DataFrame(columns=['User'])
Location = 'file.txt'
chunksize = math.ceil(284884514/30)
reader = pd.read_csv(Location, sep="\t", header=None, encoding='Latin', low_memory=False, chunksize=chunksize, names=['Follower', 'Friend'])
for network in reader:
userRelationHelp = pd.DataFrame(columns=['User'])
userRelationHelp['User'] = network['Follower'].unique()
userRelation = userRelation.append(userRelationHelp)
lst = [userRelationHelp, network]
Upvotes: 2