DyingIsFun
DyingIsFun

Reputation: 1238

Finding Arbitrarily Long Word Patterns using Regular Expression in Python

I am using Python 3.6 to find all occurrences of "as" + words + "as" in a text, with a context of three words on either side.

For example, if I run my program on

"The dog was as wildly energetic as the old one. It was as bright as it has ever been."

the ideal output would be

"The dog was as wildly energetic as the old one"
"one. It was as bright as it has ever"

This should be an easy task, but I can't figure it out. (I'm pretty new to programming.) At first I tried to do this on word-tokenized versions of the text, but think that it may be easier to use regular expression on the raw string.

The best I could come up with is:

#FINDING __ AS __ AS __ PATTERNS

raw = "The dog was as wildly energetic as the old one. It was as bright as it has ever been."

import re

pattern_find = re.compile(r'\w* as \w* as \w*')    #Here we specify the regex code.

results = pattern_find.findall(raw)    #Here we do the search and put the results in a list.

print(results)

which outputs

['was as bright as it']

completely ignoring the case where there are two words between the two occurrences of "as". This was surprising to me since I thought that by including the asterisk * on \w it would capture arbitrarily long sequences of words. (What seems to be happening is that \w* is capturing arbitrary long strings of consecutive characters, rather than words.)

My questions are:

  1. How can I use regular expression to get what I want?
  2. Is there a better way to achieve my desired result?

NOTE: I know that I can use concordance() of NLTK to find single words with context, but it doesn't allow users to specify patterns of words. An alternative to using regular expression might involve building a function off of concordance().

Upvotes: 0

Views: 328

Answers (2)

yellowblood
yellowblood

Reputation: 1631

\w is a single word character, not an entire word. \w* will indeed match a single word (i.e. consecutive word characters). You should better use \w+ though, to match a single word character or more rather than zero word characters or more.

So you may try to match more than a single word:

\w+ \w+ \w+ as \w+ as \w+ \w+ \w+

Or with an actual occurrences count:

(\w+ ){3}as \w+ as (\w+ ){3}

If you don't care how many words there are between the "as", you may match any number of occurrences:

(\w+ ){3}as (\w+ )+as (\w+ ){3}

A more advanced way to do this would be something like:

(?:(?:\w+\s+)+as\s+){2}(?:\w+\s+)+

Upvotes: 1

Aran-Fey
Aran-Fey

Reputation: 43246

Regex is the right tool for the job, though there are a few pitfalls. You have to make a pattern that captures 3 words of context at most, but fewer if there aren't 3 words.

This regex should do the trick:

(?:\S+\s+){,3}\b[aA]s(?:\s+\S+)+?\s+as\b(?:\s+\S+){,3}

Explanation:

(?:\S+\s+){,3}  # match a word, followed by space(s). Up to 3 times.
\b[aA]s         # assert word boundary and match "as"
(?:\s+\S+)+?    # match any number of words, but as few as possible
\s+             # followed by space(s)
as\b            # and another "as"
(?:\s+\S+){,3}  # match up to 3 more words

Upvotes: 1

Related Questions