user3394131
user3394131

Reputation: 199

Pandas Memory Management issue

I'm having issues with Pandas taking up way too much RAM. I've got a file of 5.5gb with 2 columns of which I want to simply save all the unique values in the first column like so:

Main File
    Follower    Friend
0   12          260009730
1   12          17568791
2   12          22512883
3   12          15808761
4   12          10135072
5   12          988
6   12          22424855
7   13          9163182
8   14          22990962
9   15          7681662
10  15          17289517

to

Result File
     User
0    12
1    13
2    14
3    15

because of RAM limitations I'm importing the main file in pieces of 30, trying to purge the dataframes from memory and only appending the result file each time. After two iterations (out of thirty) the result file is 13.5mb. But it consistently crashes after the 6th iteration and I can see in my process management that python is taking up 4.5gb of RAM. I'm trying to call the garbage collector but apparantly it's not working, can you guys help me out? My code is as follows:

i = 0
userRelation = pd.DataFrame(columns=['User'])
Location = 'file.txt'
while i < 30:
    userRelationHelp = pd.DataFrame(columns=['User'])
    print(str(i))
    network = pd.read_csv(Location, sep="\t", header=None, encoding='Latin', low_memory=False, skiprows=(i * math.ceil(284884514/30)), nrows=(((i+1) * math.ceil(284884514/30))), names=['Follower', 'Friend'])

    userRelationHelp['User'] = network['Follower'].unique()
    userRelation = userRelation.append(userRelationHelp)
    lst = [userRelationHelp, network]
    del lst
    gc.collect()
    i += 1

From what I've read the last 3 lines before i += 1 should serve to purge the larger files from memory. After each iteration I can see my RAM used at the start of the cycle constantly increasing by ~200mb, and during the cycle it'll increase by more each run.

Base Python RAM usage before running above code: 76mb

Approximate Python RAM usage at start of cycle

0: 300
1: 800
2: 1000
3: 1300

Approximate Python RAM usage at end of cycle

0: 633
1: 2000
2: 2900
3: 3700

Can imagine point out what I'm doing or assuming incorrectly?

Upvotes: 1

Views: 761

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210812

@ypnos gave you a perfect description of how it should be done in Pandaic way.

Here is my attempt to do code it:

Generate sample data and write it to CSV:

fn = r'c:/temp/data.csv'

pd.DataFrame(np.random.randint(1001, 9999, (10**5, 2))) \
  .to_csv(fn, sep='\t', index=False, header=None)

processing only those columns from the CSV that we need:

chunksize=10**2  # you may want to use 10**7 as a chunk size
reader = pd.read_csv(fn, sep='\t', usecols=[0], names=['Follower'], 
                     chunksize=chunksize, squeeze=True)

df = pd.DataFrame(np.unique(np.concatenate([ser.unique() for ser in reader])),
                  columns=['User'])

Result:

In [62]: df
Out[62]:
      User
0     1001
1     1002
2     1003
3     1004
4     1005
5     1006
6     1007
7     1008
8     1009
9     1010
...    ...
8988  9989
8989  9990
8990  9991
8991  9992
8992  9993
8993  9994
8994  9995
8995  9996
8996  9997
8997  9998

[8998 rows x 1 columns]

Upvotes: 3

ypnos
ypnos

Reputation: 52317

Your way of splitting the huge file is extremely inefficient. When you use the skiprows parameter, the reader needs to go through the file line by line, counting line-ending characters, until the requested amount of rows has passed. So in your final iteration, the whole file is read again after you crunched through most of it multiple times already.

Please note in passing that your usage of nrows looks wrong to me, too. I would expect nrows=math.ceil(284884514/30). I believe this is the explanation to your memory problem.

What you need instead is an iterative way of running through the file at once, which is provided by the chunksize parameter. The use of chunksize is very well illustrated in the Pandas IO Tools docs.

Illustratory code example:

userRelation = pd.DataFrame(columns=['User'])
Location = 'file.txt'
chunksize = math.ceil(284884514/30)

reader = pd.read_csv(Location, sep="\t", header=None, encoding='Latin', low_memory=False, chunksize=chunksize, names=['Follower', 'Friend'])
for network in reader:
    userRelationHelp = pd.DataFrame(columns=['User'])
    userRelationHelp['User'] = network['Follower'].unique()
    userRelation = userRelation.append(userRelationHelp)
    lst = [userRelationHelp, network]

Upvotes: 2

Related Questions