ay-ay-ron
ay-ay-ron

Reputation: 302

pandas .drop() memory error large file

For reference, this is all on a Windows 7 x64 bit machine in PyCharm Educational Edition 1.0.1, with Python 3.4.2 and Pandas 0.16.1

I have an ~791MB .csv file with ~3.04 million rows x 24 columns. The file contains liquor sales data for the state of Iowa from January 2014 to February 2015. If you are interested, the file can be found here: https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy.

One of the columns, titled store location, holds the address including latitude and longitude. The purpose of the program below is to take the latitude and longitude out of the store location cell and place each in its own cell. When the file is cut down to ~1.04 million rows, my program works properly.

1    import pandas as pd
2
3    #import the original file
4    sales = pd.read_csv('Iowa_Liquor_Sales.csv', header=0)
5
6    #transfer the copies into lists
7    lat = sales['STORE LOCATION']
8    lon = sales['STORE LOCATION']
9
10    #separate the latitude and longitude from each cell into their own list
11    hold = [i.split('(', 1)[1] for i in lat]
12    lat2 = [i.split(',', 1)[0] for i in hold]
13    lon2 = [i.split(',', 1)[1] for i in hold]
14    lon2 = [i.split(')', 1)[0] for i in lon2]
15
16    #put the now separate latitude and longitude back into their own columns
17    sales['LATITUDE'] = lat2
18    sales['LONGITUDE'] = lon2
19
20    #drop the store location column
21    sales = sales.drop(['STORE LOCATION'], axis=1)
22
23    #export the new panda data frame into a new file
24    sales.to_csv('liquor_data2.csv')

However, when I try to run the code with the full 3.04 million line file, it gives me this error:

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1595, in drop 
dropped = self.reindex(**{axis_name: new_axis})
  File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 2505, in reindex 
**kwargs)
  File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1751, in reindex 
self._consolidate_inplace()
  File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 2132, in _consolidate_inplace 
self._data = self._protect_consolidate(f)
  File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 2125, in _protect_consolidate 
result = f()
  File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 2131, in <lambda> 
f = lambda: self._data.consolidate()
  File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 2833, in consolidate 
bm._consolidate_inplace()
  File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 2838, in _consolidate_inplace 
self.blocks = tuple(_consolidate(self.blocks))
  File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 3817, in _consolidate 
_can_consolidate=_can_consolidate)
  File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 3840, in _merge_blocks 
new_values = _vstack([b.values for b in blocks], dtype)
  File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 3870, in _vstack 
return np.vstack(to_stack)
  File "C:\Python34\lib\site-packages\numpy\core\shape_base.py", line 228, in vstack 
return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
MemoryError

I tried running the code line-by-line in the python console and found that the error occurs after the program runs the sales = sales.drop(['STORE LOCATION'], axis=1) line.

I have searched for similar issues elsewhere and the only answer I have come up with is chunking the file as it is read by the program, like this:

#import the original file
df = pd.read_csv('Iowa_Liquor_Sales7.csv', header=0, chunksize=chunksize)
sales = pd.concat(df, ignore_index=True)

My only problem with that is then I get this error:

Traceback (most recent call last):
  File "C:/Users/Aaron/PycharmProjects/DATA/Liquor_Reasign_Pd.py", line 14, in <module>
    lat = sales['STORE LOCATION']
TypeError: 'TextFileReader' object is not subscriptable

My google-foo is all foo'd out. Anyone know what to do?

UPDATE I should specify that with the chunking method,the error comes about when the program tries to duplicate the store location column.

Upvotes: 1

Views: 4711

Answers (1)

ay-ay-ron
ay-ay-ron

Reputation: 302

So I found an answer to my issue. I ran the program in python 2.7 instead of python 3.4. The only change I made was deleting line 8, as it is unused. I don't know if 2.7 just handles the memory issue differently, or if I had improperly installed the pandas package in 3.4. I will reinstall pandas in 3.4 to see if that was the problem, but if anyone else has a similar issue, try your program in 2.7.

UPDATE Realized that I was running 32 bit python on a 64 bit machine. I upgraded my versions of python and it runs without memory errors now.

Upvotes: 3

Related Questions