Reputation: 43
I am loading in a lot of CSV files into a struct using Golang. The struct is
type csvData struct {
Index []time.Time
Columns map[string][]float64
}
I have a parser that uses:
csv.NewReader(file).ReadAll()
Then I iterate over the rows, and convert the values into their types: time.Time
or float64
.
The problem is that on disk these files consume 5GB space. Once I load them into memory they consume 12GB!
I used ioutil.ReadFile(path)
and found that this was, as expected, almost exactly the on-disk size.
Here is the code for my parser, with errors omitted for readability, if you could help me troubleshoot:
var inMemoryRepo = make([]csvData, 0)
func LoadCSVIntoMemory(path string) {
parsedData := csvData{make([]time.Time, 0), make(map[string][]float64)}
file, _ := os.Open(path)
reader := csv.NewReader(file)
columnNames := reader.Read()
columnData := reader.ReadAll()
for _, row := range columnData {
parsedData.Index = append(parsedData.Index, parseTime(row[0])) //parseTime is a simple wrapper for time.Parse
for i := range row[1:] { //parse non-index numeric columns
parsedData.Columns[columnNames[i]] = append(parsedData.Columns[columnsNames[i]], parseFloat(columnData[i])) //parseFloat is wrapper for strconv.ParseFloat
}
}
inMemoryRepo = append(inMemoryRepo, parsedData)
}
I tried troubleshooting by setting columnData
and reader
to nil at end of function call, but no change.
Upvotes: 1
Views: 1788
Reputation: 883
Reading an entire file into memory rarely is a good idea.
What if your csv is 100GiB?
If your transformation does not involve several records, maybe you could apply the following algorithm:
open csv_reader (source file)
open csv_writer (destination file)
for row in csv_reader
transform row
write row into csv_writer
close csv_reader and csv_write
Upvotes: 0
Reputation: 417787
There is nothing surprising in this. On your disk there are just the characters (bytes) of your CSV text. When you load them into memory, you create data structures from your text.
For example, a float64
value requires 64 bits in memory, that is: 8 bytes. If you have an input text "1"
, that is 1 single byte. Yet, if you create a float64
value equal to 1
, that will still consume 8 byes.
Further, string
s are stored having a string header (reflect.StringHeader
) which is 2 integer values (16 bytes on 64-bit architectures), and this header points to the actual string data. See String memory usage in Golang for details.
Also slices are similar data structures: reflect.SliceHeader
. The header consists of 3 integer values, which again is 24 bytes on 64-bit architectures even if there are no elements in the slice.
Structs on top of this may have padding (fields must be aligned to certain values), which again adds overhead. For details, see Spec: Size and alignment guarantees.
Go maps are hashmaps, which again has quite some overhead, for details see why slice values can sometimes go stale but never map values?, for memory usage see How much memory do golang maps reserve?
Upvotes: 5