Reputation: 8787
I had been staring at this problem for hours, I don't know what regex format to use to solve this problem.
Problem:
Given the following input strings, find all possible output words 5 characters or longer.
Your program should find all possible words (5+ characters) that can be derived from the strings supplied. Use http://norvig.com/ngrams/enable1.txt as your search dictionary. The order of the output words doesn't matter.
Assumptions about the input strings:
Attempted solution:
First I downloaded the the words from that webpage and store them in a file in my computer ('words.txt'):
import requests
res = requests.get('http://norvig.com/ngrams/enable1.txt')
res.raise_for_status()
fp = open('words.txt', 'wb')
for chunk in res.iter_content(100000):
fp.write(chunk)
fp.close()
I'm then trying to match the words I need using regex. The problem is that I don't know how to format my re.compile()
to achieve this.
import re
input = 'qwertyuytresdftyuioknn' #example
fp= open('words.txt')
string = fp.read()
regex = re.compile(input[0]+'\w{3,}'+input[-1]) #wrong need help here
regex.findall(string)
As it's obvious, it's wrong since I need to match letters from my input string going form left to right, not any letters which I'm mistakenly doing with \w{3,}
. Any help into this would be greatly appreciated.
Upvotes: 1
Views: 596
Reputation: 553
This feels a bit like a homework problem. Thus, I won't post the full answer, but will try to give some hints: Character groups to match are given between square brackets [adfg]
will match any of the letters a, d, f or g. [adfg]{3,}
will match any part with at least 3 of these letters. Looking at your list of words, you only want to match whole lines. If you pass re.MULTILINE
as the second argument to re.compile
, ^
will match the beginning and $
the end of a line.
Addition:
If the characters can only appear in the order given and assuming that each character can appear any number of times: 'qw*e*r*t*y*u*y*t*r*e*s*d*f*t*y*u*i*o*k*n*n'
. However, we will also have to have at least 5 characters in total. A positive lookbehind assertion (?<=\w{5})
added to the end will ensure that.
Upvotes: 1