ideasman42
ideasman42

Reputation: 48248

How to get a '&str' from a NUL-terminated byte slice if the NUL terminator isn't at the end of the slice?

While CStr is typically used for FFI, I am reading from a &[u8] which is NUL-terminated and is ensured to be valid UTF-8 so no checks are needed.

However the NUL terminator isn't necessarily at the end of the slice. What's a good way to get this as a &str?

It was suggested to use CStr::from_bytes_with_nul, but this panics on an interior \0 character (when the \0 isn't the last character).

Upvotes: 14

Views: 9901

Answers (3)

David Wood
David Wood

Reputation: 385

Three possible other ways of doing this, mostly using only functions from std.

use std::ffi::CStr;
use std::str;

fn str_from_null_terminated_utf8_safe(s: &[u8]) -> &str {
    if s.iter().any(|&x| x == 0) {
        unsafe { str_from_null_terminated_utf8(s) }
    } else {
        str::from_utf8(s).unwrap()
    }
}

// unsafe: s must contain a null byte
unsafe fn str_from_null_terminated_utf8(s: &[u8]) -> &str {
    CStr::from_ptr(s.as_ptr() as *const _).to_str().unwrap()
}

// unsafe: s must contain a null byte, and be valid utf-8
unsafe fn str_from_null_terminated_utf8_unchecked(s: &[u8]) -> &str {
    str::from_utf8_unchecked(CStr::from_ptr(s.as_ptr() as *const _).to_bytes())
}

As a slight aside: benchmark results for all the options in this thread:

With s = b"\0"

test dtwood::bench_str_from_null_terminated_utf8           ... bench:           9 ns/iter (+/- 0)
test dtwood::bench_str_from_null_terminated_utf8_safe      ... bench:          10 ns/iter (+/- 3)
test dtwood::bench_str_from_null_terminated_utf8_unchecked ... bench:           5 ns/iter (+/- 1)
test ideasman42::bench_str_from_u8_nul_utf8_unchecked      ... bench:           1 ns/iter (+/- 0)
test ker::bench_str_from_u8_nul_utf8                       ... bench:           4 ns/iter (+/- 0)
test ker::bench_str_from_u8_nul_utf8_unchecked             ... bench:           1 ns/iter (+/- 0)

with s = b"abcdefghij\0klmnop"

test dtwood::bench_str_from_null_terminated_utf8           ... bench:          15 ns/iter (+/- 2)
test dtwood::bench_str_from_null_terminated_utf8_safe      ... bench:          20 ns/iter (+/- 2)
test dtwood::bench_str_from_null_terminated_utf8_unchecked ... bench:           6 ns/iter (+/- 0)
test ideasman42::bench_str_from_u8_nul_utf8_unchecked      ... bench:           7 ns/iter (+/- 0)
test ker::bench_str_from_u8_nul_utf8                       ... bench:          15 ns/iter (+/- 2)
test ker::bench_str_from_u8_nul_utf8_unchecked             ... bench:           5 ns/iter (+/- 0)

with s = b"abcdefghij" * 512 + "\0klmnopqrs"

test dtwood::bench_str_from_null_terminated_utf8           ... bench:         351 ns/iter (+/- 35)
test dtwood::bench_str_from_null_terminated_utf8_safe      ... bench:       1,987 ns/iter (+/- 274)
test dtwood::bench_str_from_null_terminated_utf8_unchecked ... bench:         170 ns/iter (+/- 18)
test ideasman42::bench_str_from_u8_nul_utf8_unchecked      ... bench:       2,466 ns/iter (+/- 292)
test ker::bench_str_from_u8_nul_utf8                       ... bench:       1,971 ns/iter (+/- 209)
test ker::bench_str_from_u8_nul_utf8_unchecked             ... bench:       1,828 ns/iter (+/- 205)

So if you're super concerned about performance, probably best to benchmark with your particular data set - dtwood::str:from_null_terminated_utf8_unchecked seems to perform better with longer strings, but ker::bench_str_from_u8_nul_utf8_unchecked does better on small (< 20 character) strings.

Upvotes: 4

ideasman42
ideasman42

Reputation: 48248

This example finds the first NUL byte using a simple for loop, then uses Rust's standard library to return the slice as a &str (referencing the original data - zero copy).

There may well be a better way to find the first NUL byte using closures:

pub unsafe fn str_from_u8_nul_utf8_unchecked(utf8_src: &[u8]) -> &str {
    // does Rust have a built-in 'memchr' equivalent? 
    let mut nul_range_end = 1_usize;
    for b in utf8_src {
        if *b == 0 {
            break;
        }
        nul_range_end += 1;
    }
    return ::std::str::from_utf8_unchecked(&utf8_src[0..nul_range_end]);
}

While utf8_src.iter().position(|&c| c == b'\0').unwrap_or(utf8_src.len()); returns the first NUL byte (or the total length), Rust 1.15 does not optimize it into something like memchr, so a for loop might not be such a bad option for now.

Upvotes: 2

oli_obk
oli_obk

Reputation: 31283

I would use iterator adaptors to find the index of the first zero byte:

pub unsafe fn str_from_u8_nul_utf8_unchecked(utf8_src: &[u8]) -> &str {
    let nul_range_end = utf8_src.iter()
        .position(|&c| c == b'\0')
        .unwrap_or(utf8_src.len()); // default to length if no `\0` present
    ::std::str::from_utf8_unchecked(&utf8_src[0..nul_range_end])
}

This has the major advantage of requiring one to catch all cases (like no 0 in the array).

If you want the version that checks for well-formed UTF-8:

pub fn str_from_u8_nul_utf8(utf8_src: &[u8]) -> Result<&str, std::str::Utf8Error> {
    let nul_range_end = utf8_src.iter()
        .position(|&c| c == b'\0')
        .unwrap_or(utf8_src.len()); // default to length if no `\0` present
    ::std::str::from_utf8(&utf8_src[0..nul_range_end])
}

Upvotes: 10

Related Questions