songroom
songroom

Reputation: 21

How to read a CSV that includes Chinese characters in Rust?

When I read a CSV file that includes Chinese characters using the csv crate, it has a error.

fn main() {
    let mut rdr =
        csv::Reader::from_file("C:\\Users\\Desktop\\test.csv").unwrap().has_headers(false);
    for record in rdr.decode() {
        let (a, b): (String, String) = record.unwrap();
        println!("a:{},b:{}", a, b);
    }
    thread::sleep_ms(500000);
}

The error:

Running `target\release\rust_Work.exe`
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Decode("Could not convert bytes \'FromUtf8Error { bytes: [208, 213, 195, 251], error: Utf8Error { va
lid_up_to: 0 } }\' to UTF-8.")', ../src/libcore\result.rs:788
note: Run with `RUST_BACKTRACE=1` for a backtrace.
error: Process didn't exit successfully: `target\release\rust_Work.exe` (exit code: 101)

test.csv:

 1. 姓名   性别    年纪    分数     等级 
 2. 小二    男     12      88      良好
 3. 小三    男     13      89      良好 
 4. 小四    男     14      91      优秀

enter image description here

Upvotes: 1

Views: 1426

Answers (3)

Shepmaster
Shepmaster

Reputation: 431429

I have a way to solve it. Thanks all.

extern crate csv;
extern crate rustc_serialize;
extern crate encoding;
use encoding::{Encoding, EncoderTrap, DecoderTrap};
use encoding::all::{GB18030};
use std::io::prelude::*;

fn main() {
    let path = "C:\\Users\\Desktop\\test.csv";
    let mut f = File::open(path).expect("cannot open file");
    let mut reader: Vec<u8> = Vec::new();
    f.read_to_end(&mut reader).expect("can not read file");
    let mut chars = String::new();
    GB18030.decode_to(&mut reader, DecoderTrap::Ignore, &mut chars);
    let mut rdr = csv::Reader::from_string(chars).has_headers(true);
    for row in rdr.decode() {
        let (x, y, r): (String, String, String) = row.unwrap();
        println!("({}, {}): {:?}", x, y, r);
    }
}

output:

enter image description here

Upvotes: 0

Shepmaster
Shepmaster

Reputation: 431429

I'm not sure what could be done to make the error message more clear:

Decode("Could not convert bytes 'FromUtf8Error { bytes: [208, 213, 195, 251], error: Utf8Error { valid_up_to: 0 } }' to UTF-8.")

FromUtf8Error is documented in the standard library, and the text of the error says "Could not convert bytes to UTF-8" (although there's some extra detail in the middle).

Simply put, your data isn't in UTF-8 and it must be. That's all that the Rust standard library (and thus most libraries) really deal with. You will need to figure out what encoding it is in and then find some way of converting from that to UTF-8. There may be a crate to help with either of those cases.

Perhaps even better, you can save the file as UTF-8 from the beginning. Sadly, it's relatively common for people to hit this issue when using Excel, because Excel does not have a way to easily export UTF-8 CSV files. It always writes a CSV file in the system locale encoding.

Upvotes: -1

freinn
freinn

Reputation: 1079

Part 1: Read Unicode (Chinese or not) characters:

The easiest way to achieve your goal is to use the read_to_string function that mutates the String you pass to it, appending the Unicode content of your file to that passed String:

use std::io::prelude::*;
use std::fs::File;

fn main() {
    let mut f = File::open("file.txt").unwrap();
    let mut buffer = String::new();

    f.read_to_string(&mut buffer);

    println!("{}", buffer)
}

Part 2: Parse a CSV file, its delimiter being a ',':

extern crate regex;
use regex::Regex;

use std::io::prelude::*;
use std::fs::File;

fn main() {
    let mut f = File::open("file.txt").unwrap();
    let mut buffer = String::new();
    let delimiter = ",";

    f.read_to_string(&mut buffer);
    let modified_buffer = buffer.replace("\n", delimiter);
    let mut regex_str = "([^".to_string();

    regex_str.push_str(delimiter);
    regex_str.push_str("]+)");

    let mut final_part = "".to_string();
    final_part.push_str(delimiter);
    final_part.push_str("?");

    regex_str.push_str(&final_part);

    let regex_str_copy = regex_str.clone();
    regex_str.push_str(&regex_str_copy);
    regex_str.push_str(&regex_str_copy);

    let re = Regex::new(&regex_str).unwrap();

    for cap in re.captures_iter(&modified_buffer) {
        let (s1, s2, dist): (String, String, usize) =
            (cap[1].to_string(), cap[2].to_string(), cap[3].parse::<usize>().unwrap());
         println!("({}, {}): {}", s1, s2, dist);
    }
}

Sample input and output here

Upvotes: -3

Related Questions