Reputation: 91
I got 2 csv files: the size of file1
is 594,8 MB, file2
is 1,0 GB
But when I write
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
print(sys.getsizeof(df1))
print(sys.getsizeof(df2))
I get:
457048830
460467614
Why is the size of the DataFrame
is so different from the size of the CSV file ?
And why the relation between 594,8 MB and 1,0 GB (file size) and 457048830 and 460467614 is not the same? (Or it's the same but so, what is it?)
Upvotes: 1
Views: 88
Reputation: 117771
A CSV file encodes numbers in a textual way, separated by commas. That is, a 10-digit number will take up 10 bytes of data. This means that depending on the size of the numbers, n
numbers could take up anywhere from 2n
bytes to an arbitrary amount.
A DataFrame
loads data into integers, which are (generally) stored in more efficient ways. A common format is 32-bit floating points, in which every number is stored using 4 bytes.
From the above I would expect that file1
and file2
roughly contain the same amount of numbers, but file2
contains (generally) numbers that require more text to represent.
E.g. two files containing 1, 2, 3, 4, ..., 100
and 1.0001, 1.0002, 1.0003, ..., 1.0100
both contain 100 numbers, and will be roughly the same size in Python. However when saved as a textual CSV, the latter will be much bigger.
Upvotes: 1