How to get char range from byte range

Question

I have an external library whose string representation equivalent to &[char].

Some of his edit interfaces accept a range input of type CharRange = Range, which means offset based on char.

On the other hand some other rust libraries I use take type ByteRange = Range, which means offset based on u8.

Currently I am using an O(n) algorithm, and there is a performance bottleneck here.

Is there any efficient data structure to convert between two?

type CharRange = Range;
type ByteRange = Range;

fn byte_range_to_char_range(text: &str, byte_range: ByteRange) -> CharRange {
    let start = text[..byte_range.start].chars().count();
    let end = text[..byte_range.end].chars().count();
    start..end
}

fn char_range_to_byte_range(text: &str, char_range: CharRange) -> ByteRange {
    let start = text.char_indices().nth(char_range.start).map(|(i, _)| i).unwrap_or(0);
    let end = text.char_indices().nth(char_range.end).map(|(i, _)| i).unwrap_or(text.len());
    start..end
}

cafce25 · Accepted Answer

You can improve it slightly by not iterating from the very start again, but it's probably not worth it unless your texts are very long:

use std::ops::Range;
type CharRange = Range;
type ByteRange = Range;

pub fn byte_range_to_char_range(text: &str, byte_range: ByteRange) -> CharRange {
    let start = text[..byte_range.start].chars().count();
    let size = text[byte_range.start..byte_range.end].chars().count();
    start..start + size
}

pub fn char_range_to_byte_range(text: &str, char_range: CharRange) -> ByteRange {
    let mut iter = text.char_indices();
    let start = iter.nth(char_range.start).map(|(i, _)| i).unwrap_or(0);
    let end = iter
        .nth(char_range.end - char_range.start - 1)
        .map(|(i, _)| i)
        .unwrap_or(text.len());
    start..end
}

But because utf-8 is quite complex we can't do any better.

How to get char range from byte range

Answers (1)

Related Questions