murphy60
murphy60

Reputation: 13

What is a memory efficient way to read a CSV file?

My program reads CSV files using the csv crate into Vec<Vec<String>>, where the outer vector represents rows, and the inner separates rows into columns.

use std::{time, thread::{sleep, park}};
use csv;

fn main() {
    different_scope();

    println!("Parked");
    park();
}

fn different_scope() {
    println!("Reading csv");
    let _data = read_csv("data.csv");

    println!("Sleeping");
    sleep(time::Duration::from_secs(4));

    println!("Going out of scope");
}

fn read_csv(path: &str) -> Vec<Vec<String>> {
    let mut rdr = csv::Reader::from_path(path).unwrap();

    return rdr
        .records()
        .map(|row| {
            row
                .unwrap()
                .iter()
                .map(|column| column.to_string())
                .collect()
        })
        .collect();
}

I'm looking at RAM usage with htop and this uses 2.5GB of memory to read a 250MB CSV file.

Here's the contents of cat /proc/<my pid>/status

Name:   (name)
Umask:  0002
State:  S (sleeping)
Tgid:   18349
Ngid:   0
Pid:    18349
PPid:   18311
TracerPid:  0
Uid:    1000    1000    1000    1000
Gid:    1000    1000    1000    1000
FDSize: 256
Groups: 4 24 27 30 46 118 128 133 1000 
NStgid: 18349
NSpid:  18349
NSpgid: 18349
NSsid:  18311
VmPeak:  2748152 kB
VmSize:  2354932 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:   2580156 kB
VmRSS:   2345944 kB
RssAnon:     2343900 kB
RssFile:        2044 kB
RssShmem:          0 kB
VmData:  2343884 kB
VmStk:       136 kB
VmExe:       304 kB
VmLib:      2332 kB
VmPTE:      4648 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
CoreDumping:    0
THP_enabled:    1
Threads:    1
SigQ:   0/127783
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 0000000180000440
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp:    0
Speculation_Store_Bypass:   thread vulnerable
Cpus_allowed:   ffffffff
Cpus_allowed_list:  0-31
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:  0
voluntary_ctxt_switches:    9
nonvoluntary_ctxt_switches: 293

When I drop the variable, it frees the correct amount (approx. 250MB), but there's still 2.2GB left. I'm unable to read more than 2-3GB before all my memory is used and the process is killed (cargo prints "Killed").

How do I free the excess memory while the CSV is being read?

I need to process every line, but in this case I don't need to hold all this data at once, but what if I did?

I asked a related question and I was pointed to What is Rust strategy to uncommit and return memory to the operating system? which was helpful in understanding the problem, but I don't know how to solve it.

My understanding is I should switch my crate to a different memory allocator, but brute forcing through all the allocators I can find feels like an ignorant approach.

Upvotes: 1

Views: 1573

Answers (1)

BurntSushi5
BurntSushi5

Reputation: 15344

For questions about memory, it's good to develop a technique where you quantify your memory usage. You can do this by examining your representation. In this case, that's Vec<Vec<String>>. In particular, if you have a 250MB CSV file that is represented as a sequence of a sequence of fields, then it is not necessarily the case that you'll only use 250MB of memory. You need to consider the overhead of your representation.

For a Vec<Vec<String>>, we can dismiss the overhead of the outer Vec<...> as it will (in your program) be on the stack and not the heap. It is on the inner Vec<String> that is on the heap.

So if your CSV file has M records and each record has N fields, then there will be M instances of Vec<String> and M * N instances of String. The overhead of both a Vec<T> and a String is 3 * sizeof(word), with one word being the pointer to the data, another word being the length and yet another being the capacity. (That's 24 bytes for a 64-bit target.) So your total overhead for a 64-bit target is (M * 24) + (M * N * 24).

Let's test this experimentally. Since you didn't share your CSV input (you really should in the future), I'll bring my own. It's 145MB, has M=3,173,958 records with N=7 fields per record. So the total overhead for your representation is (3173958 * 24) + (3173958 * 7 * 24) = 609,399,936 bytes, or 609 MB. Let's test that with a real program:

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let input_path = match std::env::args_os().nth(1) {
        Some(p) => p,
        None => {
            eprintln!("Usage: csvmem <path>");
            std::process::exit(1);
        }
    };
    let rdr = csv::Reader::from_path(input_path)?;
    let mut records: Vec<Vec<String>> = vec![];
    for result in rdr.into_records() {
        let mut record: Vec<String> = vec![];
        for column in result?.iter() {
            record.push(column.to_string());
        }
        records.push(record);
    }
    println!("{}", records.len());
    Ok(())
}

(I've added some unnecessary type annotations in a couple places to make the code a little clearer, particularly with respect to our representation.) So let's run this program (whose only dependency is csv = "1" in my Cargo.toml):

$ echo $TIMEFMT
real %*E user %*U sys %*S maxmem %M MB faults %F
$ cargo b --release
$ time ./target/release/csvmem /m/sets/csv/pop/worldcitiespop-nice.csv
3173958

real    1.542
user    1.236
sys     0.296
maxmem  1287 MB
faults  0

The time utility here reports peak memory usage, which is actually a bit higher than what we might expect it to be: 609 + 145 = 754MB. I don't quite know enough about allocators to reason through the difference completely. It could be that the system allocator I'm using allocates bigger chunks than what is actually needed. Let's make our representation a bit more efficient by using a Box<str> instead of String. We sacrifice the ability to expand the string, but in exchange, we save 8 bytes of overhead per field. So our new overhead calculation is (3173958 * 24) + (3173958 * 7 * 16) = 431,658,288 bytes or 431MB for a difference of 609 - 431 = 178MB. So let's test our new representation and see what our delta is:

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let input_path = match std::env::args_os().nth(1) {
        Some(p) => p,
        None => {
            eprintln!("Usage: csvmem <path>");
            std::process::exit(1);
        }
    };
    let rdr = csv::Reader::from_path(input_path)?;
    let mut records: Vec<Vec<Box<str>>> = vec![];
    for result in rdr.into_records() {
        let mut record: Vec<Box<str>> = vec![];
        for column in result?.iter() {
            record.push(column.to_string().into());
        }
        records.push(record);
    }
    println!("{}", records.len());
    Ok(())
}

And to compile and run:

$ cargo b --release
$ time ./target/release/csvmem /m/sets/csv/pop/worldcitiespop-nice.csv
3173958

real    1.459
user    1.183
sys     0.266
maxmem  1093 MB
faults  0

for a total delta of 194MB. Which is pretty close to our guess.

We can optimize the representation even further by using a Vec<Box<[Box<str>]>>:

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let input_path = match std::env::args_os().nth(1) {
        Some(p) => p,
        None => {
            eprintln!("Usage: csvmem <path>");
            std::process::exit(1);
        }
    };
    let rdr = csv::Reader::from_path(input_path)?;
    let mut records: Vec<Box<[Box<str>]>> = vec![];
    for result in rdr.into_records() {
        let mut record: Vec<Box<str>> = vec![];
        for column in result?.iter() {
            record.push(column.to_string().into());
        }
        records.push(record.into());
    }
    println!("{}", records.len());
    Ok(())
}

That gives a peak memory usage of 1069 MB. So not much of a savings.

However, the best thing we can do is use a csv::StringRecord:

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let input_path = match std::env::args_os().nth(1) {
        Some(p) => p,
        None => {
            eprintln!("Usage: csvmem <path>");
            std::process::exit(1);
        }
    };
    let rdr = csv::Reader::from_path(input_path)?;
    let mut records = vec![];
    for result in rdr.into_records() {
        let record = result?;
        records.push(record);
    }
    println!("{}", records.len());
    Ok(())
}

And that gives a peak memory usage of 727MB. The secret is that a StringRecord stores fields inline without that second layer of indirection. It ends up saving quite a bit!

Of course, if you don't need to store all of the records in memory at once, then you shouldn't. And the CSV crate supports that just fine:

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let input_path = match std::env::args_os().nth(1) {
        Some(p) => p,
        None => {
            eprintln!("Usage: csvmem <path>");
            std::process::exit(1);
        }
    };
    let mut count = 0;
    let rdr = csv::Reader::from_path(input_path)?;
    for result in rdr.into_records() {
        let _ = result?;
        count += 1;
    }
    println!("{}", count);
    Ok(())
}

And that program's peak memory usage is only 9MB, as you'd expect of a streaming implementation. (Technically, you can use no heap memory at all if you drop down and use the csv-core crate.)

Upvotes: 10

Related Questions