noconst
noconst

Reputation: 679

Filter strings with regex

I need to filter (select) strings that follow certain rules, print them and count the number filtered strings. The input is a big string and I need to apply the following rules on each line:

  1. line must not contain any of ab, cd, pq, or xy
  2. line must contain any of the vowels
  3. line must contain a letter that repeats itself, like aa, ff, yy etc

I'm using the regex crate and it provides regex::RegexSet so I can combine multiple rules. The rules I added are as follows

    let regexp = regex::RegexSet::new(&[
        r"^((?!ab|cd|pq|xy).)*",         // rule 1
        r"((.)\1{9,}).*",                // rule 3
        r"(\b[aeiyou]+\b).*",            // rule 2
    ])

But I don't know how to use these rules to filter the lines and iterate over them.

pub fn p1(lines: &str) -> u32 {
    lines
      .split_whitespace().filter(|line| { /* regex filter goes here */ })
      .map(|line| println!("{}", line))
      .count() as u32
}

Also the compiler says that the crate doesn't support look-around, including look-ahead and look-behind.

Upvotes: 2

Views: 2781

Answers (1)

BurntSushi5
BurntSushi5

Reputation: 15354

If you're looking to use a single regex, then doing this via the regex crate (which, by design, and as documented, does not support look-around or backreferences) is probably not possible. You could use a RegexSet, but implementing your third rule would require using a regex that lists every repetition of a Unicode letter. This would not be as bad if you were okay limiting this to ASCII, but your comments suggest this isn't acceptable.

So I think your practical options here are to either use a library that supports fancier regex features (such as fancy-regex for a pure Rust library, or pcre2 if you're okay using a C library), or writing just a bit more code:

use regex::Regex;

fn main() {
    let corpus = "\
baz
ab
cwm
foobar
quux
foo pq bar
";

    let blacklist = Regex::new(r"ab|cd|pq|xy").unwrap();
    let vowels = Regex::new(r"[aeiouy]").unwrap();
    let it = corpus
        .lines()
        .filter(|line| !blacklist.is_match(line))
        .filter(|line| vowels.is_match(line))
        .filter(|line| repeated_letter(line));
    for line in it {
        println!("{}", line);
    }
}

fn repeated_letter(line: &str) -> bool {
    let mut prev = None;
    for ch in line.chars() {
        if prev.map_or(false, |prev| prev == ch) {
            return true;
        }
        prev = Some(ch);
    }
    false
}

Playground link: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=c0928793474af1f9c0180c1ac8fd2d47

Upvotes: 2

Related Questions