tadm123
tadm123

Reputation: 8787

Matching words with Regex (Python 3)

I had been staring at this problem for hours, I don't know what regex format to use to solve this problem.

Problem:

Given the following input strings, find all possible output words 5 characters or longer.

  1. qwertyuytresdftyuioknn
  2. gijakjthoijerjidsdfnokg

Your program should find all possible words (5+ characters) that can be derived from the strings supplied. Use http://norvig.com/ngrams/enable1.txt as your search dictionary. The order of the output words doesn't matter.

  1. queen question
  2. gaeing garring gathering gating geeing gieing going goring

Assumptions about the input strings:

Attempted solution:

First I downloaded the the words from that webpage and store them in a file in my computer ('words.txt'):

import requests
res = requests.get('http://norvig.com/ngrams/enable1.txt')
res.raise_for_status()
fp = open('words.txt', 'wb')
for chunk in res.iter_content(100000):
    fp.write(chunk)
fp.close()

I'm then trying to match the words I need using regex. The problem is that I don't know how to format my re.compile() to achieve this.

import re
input = 'qwertyuytresdftyuioknn'         #example
fp= open('words.txt')
string = fp.read()

regex = re.compile(input[0]+'\w{3,}'+input[-1])   #wrong need help here
regex.findall(string)

As it's obvious, it's wrong since I need to match letters from my input string going form left to right, not any letters which I'm mistakenly doing with \w{3,}. Any help into this would be greatly appreciated.

Upvotes: 1

Views: 596

Answers (1)

Peter
Peter

Reputation: 553

This feels a bit like a homework problem. Thus, I won't post the full answer, but will try to give some hints: Character groups to match are given between square brackets [adfg] will match any of the letters a, d, f or g. [adfg]{3,} will match any part with at least 3 of these letters. Looking at your list of words, you only want to match whole lines. If you pass re.MULTILINE as the second argument to re.compile, ^ will match the beginning and $ the end of a line.

Addition:

If the characters can only appear in the order given and assuming that each character can appear any number of times: 'qw*e*r*t*y*u*y*t*r*e*s*d*f*t*y*u*i*o*k*n*n'. However, we will also have to have at least 5 characters in total. A positive lookbehind assertion (?<=\w{5}) added to the end will ensure that.

Upvotes: 1

Related Questions