Reputation: 27
I want to find words in a document using only the letters in a given pattern, but those letters can appear at most once. Suppose document.txt consists of "abcd abbcd" What pattern (and what concepts are involved in writing such a pattern) will return "abcd" and not "abbcd"?
Upvotes: 0
Views: 420
Reputation: 5347
grep -Pwo '[abc]+' | grep -Pv '([abc]).*\1' | awk 'length==3'
where:
grep
: a word composed by the pattern letters...grep
: ... with no repeated letters ...awk
: ...whose length is the number of lettersUpvotes: 0
Reputation: 1031
There were already some good ideas here, but I wanted to offer an example implementation in python. This isn't necessarily optimal, but it should work. Usage would be:
$ python find.py -p abcd < file.txt
And the implementation of find.py is:
import argparse
import sys
from itertools import cycle
parser = argparse.ArgumentParser()
parser.add_argument('-p', required=True)
args = parser.parse_args()
for line in sys.stdin:
for candidate in line.split():
present = dict(zip(args.p, cycle((0,)))) # initialize dict of letter:count
for ch in candidate:
if ch in present:
present[ch] += 1
if all(x <= 1 for x in present.values()):
print(candidate)
This handles your requirement of matching each character in the pattern at most once, i.e. it allows for zero matches. If you wanted to match each character exactly once, you'd change the second-to-last line to:
if all(x == 1 for x in present.values()):
Upvotes: 1
Reputation: 10882
Melpomene is right, regexps are not the best instrument to solve this task. Regexp is essentially a finite state machine. In your case current state can be defined as the combination of presence flags for each of the letters from your alphabet. Thus the total number of internal states in regex will be 2^N where N is the number of allowed letters.
The easiest way to define such regex will be list all possible permutations of available letters (and use ?
to eliminate necessity to list shorter sequences). For three letters (a,b,c) regex looks like:
a?b?c?|a?c?b?|b?a?c?|b?c?a?|c?a?b?|c?b?a?
For the four letters (a,b,c,d) it becomes much longer:
a?b?c?d?|a?b?d?c?|a?c?b?d?|a?c?d?b?|a?d?b?c?|a?d?c?b?|b?a?c?d?|b?a?d?c?|b?c?a?d?|b?c?d?a?|b?d?a?c?|b?d?c?a?|c?a?b?d?|c?a?d?b?|c?b?a?d?|c?b?d?a?|c?d?a?b?|c?d?b?a?|d?a?b?c?|d?a?c?b?|d?b?a?c?|d?b?c?a?|d?c?a?b?|d?c?b?a?
As you can see, not that convenient.
The solution without regexps depends on your toolset. I would write a simple program that processes input text word by word. At the start of the word BitSet
is created, where each bit represents the presence of the corresponding letter of the desired alphabet. While traversing the word if bit that corresponds to the current letter is zero it becomes one. If already marked bit occurs or letter is not in alphabet, word is skipped. If word is completely evaluated, then it's "valid".
Upvotes: 0
Reputation: 1166
You could check if a character appears more than once and then negate the result (in your source code):
([a-z])[a-z]*\1
(that matches abbcd
, but not abcd
)Explanation:
([a-z])
matches any single character [a-z]*
allows none or more characters after the one matched above\1
is a back reference to the character found at ([a-z])
Upvotes: 1