Devesh Shukla
Devesh Shukla

Reputation: 43

Golang reading csv consuming more than 2x space in memory than on disk

I am loading in a lot of CSV files into a struct using Golang. The struct is

type csvData struct {
    Index   []time.Time
    Columns map[string][]float64
}    

I have a parser that uses:

csv.NewReader(file).ReadAll()

Then I iterate over the rows, and convert the values into their types: time.Time or float64.

The problem is that on disk these files consume 5GB space. Once I load them into memory they consume 12GB!

I used ioutil.ReadFile(path) and found that this was, as expected, almost exactly the on-disk size.

Here is the code for my parser, with errors omitted for readability, if you could help me troubleshoot:

var inMemoryRepo = make([]csvData, 0)

func LoadCSVIntoMemory(path string) {
    parsedData := csvData{make([]time.Time, 0), make(map[string][]float64)}
    file, _ := os.Open(path)
    reader := csv.NewReader(file)
    columnNames := reader.Read()
    columnData := reader.ReadAll()
    for _, row := range columnData {
        parsedData.Index = append(parsedData.Index, parseTime(row[0])) //parseTime is a simple wrapper for time.Parse
        for i := range row[1:] {                                       //parse non-index numeric columns
            parsedData.Columns[columnNames[i]] = append(parsedData.Columns[columnsNames[i]], parseFloat(columnData[i])) //parseFloat is wrapper for strconv.ParseFloat
        }
    }
    inMemoryRepo = append(inMemoryRepo, parsedData)
}

I tried troubleshooting by setting columnData and reader to nil at end of function call, but no change.

Upvotes: 1

Views: 1788

Answers (2)

Fulldump
Fulldump

Reputation: 883

Reading an entire file into memory rarely is a good idea.

What if your csv is 100GiB?

If your transformation does not involve several records, maybe you could apply the following algorithm:

open csv_reader (source file)
open csv_writer (destination file)
for row in csv_reader
    transform row
    write row into csv_writer
close csv_reader and csv_write

Upvotes: 0

icza
icza

Reputation: 417787

There is nothing surprising in this. On your disk there are just the characters (bytes) of your CSV text. When you load them into memory, you create data structures from your text.

For example, a float64 value requires 64 bits in memory, that is: 8 bytes. If you have an input text "1", that is 1 single byte. Yet, if you create a float64 value equal to 1, that will still consume 8 byes.

Further, strings are stored having a string header (reflect.StringHeader) which is 2 integer values (16 bytes on 64-bit architectures), and this header points to the actual string data. See String memory usage in Golang for details.

Also slices are similar data structures: reflect.SliceHeader. The header consists of 3 integer values, which again is 24 bytes on 64-bit architectures even if there are no elements in the slice.

Structs on top of this may have padding (fields must be aligned to certain values), which again adds overhead. For details, see Spec: Size and alignment guarantees.

Go maps are hashmaps, which again has quite some overhead, for details see why slice values can sometimes go stale but never map values?, for memory usage see How much memory do golang maps reserve?

Upvotes: 5

Related Questions