Reputation: 230
I wish to search a text file for a given word that may optionally be hyphenated at an unknown position within the word and split across consecutive lines.
eg. match "hyphenated" within:
This sentence contains a hyphena-
ted word.
Closest (unattractive) solution:
"h\(-\s*\n\s*\)\?y\(-\s*\n\s*\)\?p\(-\s*\n\s*\)\?h\(-\s*\n\s*\)\?e\(-\s*\n\s*\)\?n\(-\s*\n\s*\)\?a\(-\s*\n\s*\)\?t\(-\s*\n\s*\)\?e\(-\s*\n\s*\)\?d"
I'm hoping that some regex-foo stronger than mine can come up with a regex that clearly includes the word being searched for, ie. I'd like to see "hyphenated" in there. I haven't found a way to encode something like the following (which would be buggy anyway, since it would match "hy-ted"):
"{prefix-of:hyphenated}{hyphen/linebreak}{suffix-of:hyphenated}"
I realize that pre-processing the document to collapse such words would make the search simpler but I'm looking for a regex that I can use in a context where this won't be possible due to the tools involved.
Upvotes: 4
Views: 866
Reputation: 14116
Another way to approach this, just right of the bat, is to "slide" the hyphenation like this:
hyphenated|h(-\s*\n\s*)yphenated|hy(-\s*\n\s*)phenated|hyp(-\s*\n\s*)henated|hyph(-\s*\n\s*)enated|hyphe(-\s*\n\s*)nated|hyphen(-\s*\n\s*)ated|hyphena(-\s*\n\s*)ted|hyphenat(-\s*\n\s*)ed|hyphenate(-\s*\n\s*)d
Reads better, but I don't really know how this stands performance wise to your original pattern.
Yet another idea is to first narrow the search with a pattern along these lines:
h[hypenatd]{0,9}(-\s*\n*\s)?[hypenatd]{0,9}
and then match within the results of this one.
In fact, if I'm not mistaken, if you match with groups like this:
(h[hypenatd]{0,9})(?:-\s*\n*\s)?([hypenatd]{0,9})
then the occurences of the word hyphenated
are all the matches where, pseudocodily:
(match.group1 + match.group2) == "hyphenated"
Upvotes: 0
Reputation: 6571
I think this would work. If you have many words to search for, you would probably want to create a script to generate the search pattern for you.
[h\-]+\s*[y\-\s]+[p\-\s]+[h\-\s]+[e\-\s]+[n\-\s]+[a\-\s]+[t\-\s]+[e\-\s]+d\b
I don't think you mentioned which language you are using, but I tested this with .Net.
Here's a simple python script that will generate search patterns:
# patterngen.py
# Usage: python patterngen.py <word>
# Example: python patterngen.py hyphenated
word = sys.argv[1]
pattern = '[' + word[0] + r'\-]+\s*'
for i in range(1,len(word)-1):
pattern = pattern + r'[' + word[i]
pattern = pattern + r'\-\s]+'
pattern = pattern + word[-1] + r'\b'
print pattern
Upvotes: 0
Reputation: 425043
Considering that hy-phen-ated
should also match, I think this is a case where regex alone isn't the right way to go.
I would do this (not knowing your language, I've used pseudo code):
.*hyphenated.*
All languages can achieve step 1. trivially, and the code would be so much more readable.
Upvotes: 1