user1775138
user1775138

Reputation: 230

regex to match specific words hyphenated at arbitrary positions and split across two lines

I wish to search a text file for a given word that may optionally be hyphenated at an unknown position within the word and split across consecutive lines.

eg. match "hyphenated" within:

This sentence contains a hyphena-
ted word.

Closest (unattractive) solution:

"h\(-\s*\n\s*\)\?y\(-\s*\n\s*\)\?p\(-\s*\n\s*\)\?h\(-\s*\n\s*\)\?e\(-\s*\n\s*\)\?n\(-\s*\n\s*\)\?a\(-\s*\n\s*\)\?t\(-\s*\n\s*\)\?e\(-\s*\n\s*\)\?d"

I'm hoping that some regex-foo stronger than mine can come up with a regex that clearly includes the word being searched for, ie. I'd like to see "hyphenated" in there. I haven't found a way to encode something like the following (which would be buggy anyway, since it would match "hy-ted"):

"{prefix-of:hyphenated}{hyphen/linebreak}{suffix-of:hyphenated}"

I realize that pre-processing the document to collapse such words would make the search simpler but I'm looking for a regex that I can use in a context where this won't be possible due to the tools involved.

Upvotes: 4

Views: 866

Answers (3)

famousgarkin
famousgarkin

Reputation: 14116

Another way to approach this, just right of the bat, is to "slide" the hyphenation like this:

hyphenated|h(-\s*\n\s*)yphenated|hy(-\s*\n\s*)phenated|hyp(-\s*\n\s*)henated|hyph(-\s*\n\s*)enated|hyphe(-\s*\n\s*)nated|hyphen(-\s*\n\s*)ated|hyphena(-\s*\n\s*)ted|hyphenat(-\s*\n\s*)ed|hyphenate(-\s*\n\s*)d

Reads better, but I don't really know how this stands performance wise to your original pattern.


Yet another idea is to first narrow the search with a pattern along these lines:

h[hypenatd]{0,9}(-\s*\n*\s)?[hypenatd]{0,9}

and then match within the results of this one.

In fact, if I'm not mistaken, if you match with groups like this:

(h[hypenatd]{0,9})(?:-\s*\n*\s)?([hypenatd]{0,9})

then the occurences of the word hyphenated are all the matches where, pseudocodily:

(match.group1 + match.group2) == "hyphenated"

Upvotes: 0

David
David

Reputation: 6571

I think this would work. If you have many words to search for, you would probably want to create a script to generate the search pattern for you.

[h\-]+\s*[y\-\s]+[p\-\s]+[h\-\s]+[e\-\s]+[n\-\s]+[a\-\s]+[t\-\s]+[e\-\s]+d\b

I don't think you mentioned which language you are using, but I tested this with .Net.

Here's a simple python script that will generate search patterns:

# patterngen.py
# Usage:  python patterngen.py <word>
# Example:  python patterngen.py hyphenated

word = sys.argv[1]
pattern = '[' + word[0] + r'\-]+\s*'

for i in range(1,len(word)-1):
    pattern = pattern + r'[' + word[i]
    pattern = pattern + r'\-\s]+'

pattern = pattern + word[-1] + r'\b'
print pattern

Upvotes: 0

Bohemian
Bohemian

Reputation: 425043

Considering that hy-phen-ated should also match, I think this is a case where regex alone isn't the right way to go.

I would do this (not knowing your language, I've used pseudo code):

  1. remove hyphens and newlines from input
  2. match cleaned input with .*hyphenated.*

All languages can achieve step 1. trivially, and the code would be so much more readable.

Upvotes: 1

Related Questions