Reputation: 3324
I have a large dataset of texts and their corresponding labels. I used to read csv files using csv
module and then build numpy
arrays on that data till I found out having large text arrays in numpy is memory inefficient.
with open('sample.csv', 'r') as f:
data = csv.reader(f.readlines())
texts = np.array([d[0] for d in data])
And this takes about 13GB memory. But when pandas
reads the very same data, it's like nothing happened, no data is present in memory. And by this I mean it's not 50% less memory usage or even 20%, it takes just 300 MB of memory.
data = pd.read_csv('sample.csv')
texts2 = np.array(data['text'])
The only difference between texts
and texts2
arrays is the dtype:
texts.dtype
dtype('<U92569')
texts2.dtype
dtype('O')
Upvotes: 1
Views: 86
Reputation: 280564
Your first array is using a NumPy string dtype. Those are fixed-width, so every element of the array takes as much space as the longest string in the array, and one of the strings is 92569 characters long, driving up the space requirements for the shorter strings.
Your second array is using object dtype. That just holds references to a bunch of regular Python objects, so each element is a regular Python string object. There's additional per-element object overhead, but each string only needs enough room to hold its own data, instead of enough room to hold the biggest string in the array.
Also, NumPy unicode dtypes always use 4 bytes per character, while Python string objects use less if the string doesn't contain any high code points.
Upvotes: 5