DrewHoo
DrewHoo

Reputation: 27

What pattern in a grep command will match each character at most once?

I want to find words in a document using only the letters in a given pattern, but those letters can appear at most once. Suppose document.txt consists of "abcd abbcd" What pattern (and what concepts are involved in writing such a pattern) will return "abcd" and not "abbcd"?

Upvotes: 0

Views: 420

Answers (4)

JJoao
JJoao

Reputation: 5347

grep -Pwo '[abc]+' | grep -Pv '([abc]).*\1' | awk 'length==3'

where:

  • first grep: a word composed by the pattern letters...
  • second grep: ... with no repeated letters ...
  • awk: ...whose length is the number of letters

Upvotes: 0

mgrant
mgrant

Reputation: 1031

There were already some good ideas here, but I wanted to offer an example implementation in python. This isn't necessarily optimal, but it should work. Usage would be:

$ python find.py -p abcd < file.txt

And the implementation of find.py is:

import argparse
import sys
from itertools import cycle

parser = argparse.ArgumentParser()
parser.add_argument('-p', required=True)
args = parser.parse_args()

for line in sys.stdin:
    for candidate in line.split():
        present = dict(zip(args.p, cycle((0,)))) # initialize dict of letter:count
        for ch in candidate:
            if ch in present:
                present[ch] += 1 
        if all(x <= 1 for x in present.values()):
            print(candidate)

This handles your requirement of matching each character in the pattern at most once, i.e. it allows for zero matches. If you wanted to match each character exactly once, you'd change the second-to-last line to:

        if all(x == 1 for x in present.values()):

Upvotes: 1

Aivean
Aivean

Reputation: 10882

Melpomene is right, regexps are not the best instrument to solve this task. Regexp is essentially a finite state machine. In your case current state can be defined as the combination of presence flags for each of the letters from your alphabet. Thus the total number of internal states in regex will be 2^N where N is the number of allowed letters.

The easiest way to define such regex will be list all possible permutations of available letters (and use ? to eliminate necessity to list shorter sequences). For three letters (a,b,c) regex looks like:

a?b?c?|a?c?b?|b?a?c?|b?c?a?|c?a?b?|c?b?a?

For the four letters (a,b,c,d) it becomes much longer:

a?b?c?d?|a?b?d?c?|a?c?b?d?|a?c?d?b?|a?d?b?c?|a?d?c?b?|b?a?c?d?|b?a?d?c?|b?c?a?d?|b?c?d?a?|b?d?a?c?|b?d?c?a?|c?a?b?d?|c?a?d?b?|c?b?a?d?|c?b?d?a?|c?d?a?b?|c?d?b?a?|d?a?b?c?|d?a?c?b?|d?b?a?c?|d?b?c?a?|d?c?a?b?|d?c?b?a?

As you can see, not that convenient.

The solution without regexps depends on your toolset. I would write a simple program that processes input text word by word. At the start of the word BitSet is created, where each bit represents the presence of the corresponding letter of the desired alphabet. While traversing the word if bit that corresponds to the current letter is zero it becomes one. If already marked bit occurs or letter is not in alphabet, word is skipped. If word is completely evaluated, then it's "valid".

Upvotes: 0

CaHa
CaHa

Reputation: 1166

You could check if a character appears more than once and then negate the result (in your source code):

  • split your document into words
  • check each word with ([a-z])[a-z]*\1 (that matches abbcd, but not abcd)
  • negate the result

Explanation:

  • ([a-z]) matches any single character
  • [a-z]* allows none or more characters after the one matched above
  • \1 is a back reference to the character found at ([a-z])

Upvotes: 1

Related Questions