Why getsizeof(pandas.DataFrame) different of the file in the computer?

Question

I got 2 csv files: the size of file1 is 594,8 MB, file2 is 1,0 GB

But when I write

df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
print(sys.getsizeof(df1))
print(sys.getsizeof(df2))

I get:

457048830
460467614

Why is the size of the DataFrame is so different from the size of the CSV file ?

And why the relation between 594,8 MB and 1,0 GB (file size) and 457048830 and 460467614 is not the same? (Or it's the same but so, what is it?)

orlp · Accepted Answer

A CSV file encodes numbers in a textual way, separated by commas. That is, a 10-digit number will take up 10 bytes of data. This means that depending on the size of the numbers, n numbers could take up anywhere from 2n bytes to an arbitrary amount.

A DataFrame loads data into integers, which are (generally) stored in more efficient ways. A common format is 32-bit floating points, in which every number is stored using 4 bytes.

From the above I would expect that file1 and file2 roughly contain the same amount of numbers, but file2 contains (generally) numbers that require more text to represent.

E.g. two files containing 1, 2, 3, 4, ..., 100 and 1.0001, 1.0002, 1.0003, ..., 1.0100 both contain 100 numbers, and will be roughly the same size in Python. However when saved as a textual CSV, the latter will be much bigger.

Why getsizeof(pandas.DataFrame) different of the file in the computer?

Answers (1)

Related Questions