The memory of the 'read_csv' data is different from that of the original data

Question

print(ua_df)
# show
ID      classification  years   Gender
347     member          070     female
597     member          050     male

s2 = sys.getsizeof(ua_df)
print(s2)

# 6974117328 [is 6.5G]

# Original file size：842.1M
# The memory size is not equal to the original file


print(uad_dff)
# show
ID  shopCD  distance
727     27      40.22
942     27      30.76

Under the same conditions
s3 = sys.getsizeof(uad_dff)
print(s3)

# 12483776 [is 11.9M]

# Original file size：11.9M
# Equal to the memory size of the original file

Why is the memory of original file much smaller than that of read data, There is no difference in the second example, can anyone tell me why? Thank you very much！

juanpa.arrivillaga · Accepted Answer

Consider item = ua_df.iat[0, 'years']. item is a str object of length 3. In a csv file on disk, which is just text, this requires 3 bytes to represent. In memory, since pandas uses object dtype to store this, it requires a pointer to a str object, just the pointer will require a machine word, 8 bytes (on a 64 bit architecture). This Python str object, on my machine, requires sys.getsizeof(item) == 54 bytes. So in memory, you require a total of 62 bytes to represent same data that was stored as text in 3 bytes.

The sort of size discrepancy you are seeing here is not something unexpected.

Consider storing numeric types. Pandas will likely use a np.int64 or np.float64, both of which require 8 bytes. But what if all your numbers are only 2-3 digits? It will be require 2-3 bytes to represent as text on disk. So it depends on the average number of decimal digits required to store them as text. Could be more or less than the uniform 8-bytes per numeric object.

The memory of the 'read_csv' data is different from that of the original data

Answers (1)

Related Questions

The memory of the &#39;read_csv&#39; data is different from that of the original data

Answers (1)

Related Questions

The memory of the 'read_csv' data is different from that of the original data