why would data have different footprints for disk versus memory?

Question

"We have the 2015 Yellow Cab NYC Taxi data as 12 CSV files on S3... This data is about 20GB on disk or 60GB in RAM."

i came across this observation while trying out dask, a python framework for handling out of memory datasets.

can someone explain to me why there is a 3x difference? id imagine it has to do with python objects but am not 100% sure.

thanks!

Son of a Beach · Accepted Answer

You are reading from a CSV on disk into a structured data frame object in memory. The two things are not at all analogous. The CSV data on disk is a single string of text. The data in memory is a complex data structure, with multiple data types, internal pointers, etc.

The CSV itself is not taking up any RAM. There is a complex data structure that is taking up RAM, and it was populated using data sourced from the CSV on disk. This is not at all the same thing.

To illustrate the difference, you could try reading the CSV into an actual single string variable and seeing how much memory that consumes. In this case, it would effectively be a single CSV string in memory:

with open('data.csv', 'r') as csvFile:
    data=csvFile.read()

why would data have different footprints for disk versus memory?

Answers (1)

Related Questions