Reputation: 4277
How can I consume a list of tokens that may or may not be separated by a space?
I'm trying to parse Chinese romanization (pinyin) in the cedict format with nom
(6.1.2). For example "ni3 hao3 ma5"
which is, due to human error in transcription, sometimes written as "ni3hao3ma5"
or "ni3hao3 ma5"
(note the variable spacing).
I have written a parser that will handle individual syllables e.g. ["ni3", "hao3", "ma5"]
, and I'm trying to use a nom::multi::separated_list0
to parse it like so:
nom::multi::separated_list0(
nom::character::complete::space0,
syllable,
)(i)?;
However, I get a Err(Error(Error { input: "", code: SeparatedList }))
after all the tokens have been consumed.
Upvotes: 0
Views: 1362
Reputation: 4277
The problem with using
nom::multi::separated_list0(
nom::character::complete::space0,
syllable,
)(i)?;
Is that the space0
delimiter matches empty string, so it will reach the end of the input string and the separated_list0
will continue to try to consume the empty string, hence the Err(Error(Error { input: "", code: SeparatedList }))
.
The solution in my case was to use nom::multi::many1
and handling the optional spaces in the inner parser instead of nom::multi::separated_list0
like so:
fn syllables(i: &str) -> IResult<&str, Vec<Syllable>> {
// many 👇 instead of separated_list0
multi::many1(syllable)(i)
}
fn syllable(i: &str) -> IResult<&str, Syllable> {
let (rest, (_, pronunciation, tone)) = sequence::tuple((
// and handle the optional space
// here 👇
character::complete::space0,
character::complete::alpha1,
character::complete::digit0,
))(i)?;
Ok((rest, Syllable::new(pronunciation, tone)))
}
Upvotes: 1