Reputation: 784
print(ua_df)
# show
ID classification years Gender
347 member 070 female
597 member 050 male
s2 = sys.getsizeof(ua_df)
print(s2)
# 6974117328 [is 6.5G]
# Original file size:842.1M
# The memory size is not equal to the original file
print(uad_dff)
# show
ID shopCD distance
727 27 40.22
942 27 30.76
Under the same conditions
s3 = sys.getsizeof(uad_dff)
print(s3)
# 12483776 [is 11.9M]
# Original file size:11.9M
# Equal to the memory size of the original file
Why is the memory of original file much smaller than that of read data, There is no difference in the second example, can anyone tell me why? Thank you very much!
Upvotes: 2
Views: 1562
Reputation: 95948
Consider item = ua_df.iat[0, 'years']
. item
is a str
object of length 3. In a csv file on disk, which is just text, this requires 3 bytes to represent. In memory, since pandas
uses object
dtype to store this, it requires a pointer to a str
object, just the pointer will require a machine word, 8 bytes (on a 64 bit architecture). This Python str
object, on my machine, requires sys.getsizeof(item) == 54
bytes. So in memory, you require a total of 62 bytes to represent same data that was stored as text in 3 bytes.
The sort of size discrepancy you are seeing here is not something unexpected.
Consider storing numeric types. Pandas will likely use a np.int64
or np.float64
, both of which require 8 bytes. But what if all your numbers are only 2-3 digits? It will be require 2-3 bytes to represent as text on disk. So it depends on the average number of decimal digits required to store them as text. Could be more or less than the uniform 8-bytes per numeric object.
Upvotes: 4