Reputation: 298
A string is given as an input (e.g. "What is your name?"). The input always contains a question which I want to extract. But the problem that I am trying to solve is that the input is always with unneeded input.
So the input could be (but not limited to) the following:
1- "eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn" 2- "What is your\nlastname and email?\ndasf?lkjas" 3- "askjdmk.\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"
(Notice that at the third input, the question starts with the word "Given" and end with "yourself?")
The above input examples are generated by the pytesseract OCR library of scanning an image and converting it into text
I only want to extract the question from the garbage input and nothing else.
I tried to use find('?', 1) function of the re library to get index of last part of the question (assuming for now that the first question mark is always the end of the question and not part of the input that I don't want). But I can't figure out how to get the index of the first letter of the question. I tried to loop in reverse and get the first spotted \n in the input, but the question doesn't always have \n before the first letter of the question.
def extractQuestion(q):
index_end_q = q.find('?', 1)
index_first_letter_of_q = 0 # TODO
question = '\n ' . join(q[index_first_letter_of_q :index_end_q ])
Upvotes: 13
Views: 1347
Reputation: 1290
A way to find the question's first word index would be to search for the first word that has an actual meaning (you're interested in English words I suppose). A way to do that would be using pyenchant
:
#!/usr/bin/env python
import enchant
GLOSSARY = enchant.Dict("en_US")
def isWord(word):
return True if GLOSSARY.check(word) else False
sentences = [
"eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
"What is your\nlastname and email?\ndasf?lkjas",
"\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]
for sentence in sentences:
for i,w in enumerate(sentence.split()):
if isWord(w):
print('index: {} => {}'.format(i, w))
break
The above piece of code gives as a result:
index: 3 => What
index: 0 => What
index: 0 => Given
Upvotes: 11
Reputation: 82899
You could try a regular expression like \b[A-Z][a-z][^?]+\?
, meaning:
\b
with an upper case letter [A-Z]
followed by a lower case letter [a-z]
,[^?]+
,\?
.This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.
>>> tests = ["eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
"What is your\nlastname and email?\ndasf?lkjas",
"\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]
>>> import re
>>> p = r"\b[A-Z][a-z][^?]+\?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
'What is your\nlastname and email?',
'Given your skills\nhow would you rate yourself?']
If that's one blob of text, you can use findall
instead of search
:
>>> text = "\n".join(tests)
>>> re.findall(p, text)
['What is your name?',
'What is your\nlastname and email?',
'Given your skills\nhow would you rate yourself?']
Actually, this also seems to work reasonably well for questions with names in them:
>>> t = "asdGARBAGEasd\nHow did you like St. Petersburg? more stuff with ?"
>>> re.search(p, t).group()
'How did you like St. Petersburg?'
Upvotes: 6