Reputation: 689
I am trying to find a string within a long text extracted from a PDF file, and get the string's position in the text, and then return 100 words before the string and 100 after. The problem is that the extraction is not perfect, so I am having a problem like this:
The query string is "test text"
The text may look like:
This is atest textwith a problem
as you can see the word "test" is joined with the letter "a" and the word "text" is joined with the word "with"
So the only function is working with me is __contains __ which doesn't return the position of the word.
Any ideas to find all the occurences of a word in such a text with their postions?
Thank you very much
Upvotes: 1
Views: 313
Reputation: 46789
You could take the following kind of approach. This first attempts to split the whole text into words, and keeps note of the index of each word.
Next it iterates through the text looking for test text
with possible 0 or more spaces between. For each match it notes the start and then creates a list of words found before and after that point using Python's bisect
library to locate the required entries in the words
list.
import bisect
import re
test = "aa bb cc dd test text ee ff gg testtextwith hh ii jj"
words = [(w.start(), w.group(0)) for w in re.finditer(r'(\b\w+?\b)', test)]
adjacent_words = 2
for match in re.finditer(r'(test\s*?text)', test):
start, end = match.span()
words_start = bisect.bisect_left(words, (start, ''))
words_end = bisect.bisect_right(words, (end, ''))
words_before = [w for i, w in words[words_start-adjacent_words : words_start]]
words_after = [w for i, w in words[words_end : words_end + adjacent_words]]
# Adjacent words as a list
print words_before, match.group(0), words_after
# Or, surrounding text as is.
print test[words[words_start-adjacent_words][0] : words[words_end+adjacent_words][0]]
print
So for this example with 2 adjacent words, you would get the following output:
['cc', 'dd'] test text ['ee', 'ff']
cc dd test text ee ff
['ff', 'gg'] testtext ['hh', 'ii']
ff gg testtextwith hh ii
Upvotes: 3
Reputation: 104102
You might have a look at the regex module which allows for 'fuzzy' matching:
>>> import regex
>>> s='This is atest textwith a problem'
>>> regex.search(r'(?:text with){e<2}', s)
<regex.Match object; span=(14, 22), match='textwith', fuzzy_counts=(0, 0, 1)>
>>> regex.search(r'(?:test text){e<2}', s)
<regex.Match object; span=(8, 18), match='atest text', fuzzy_counts=(0, 1, 0)>
You can match text that has insertions, deletions, and errors. The match group returned has the span and index.
You can use regex.findall
to find all the potential target matches.
Perfect for what you are describing.
Upvotes: 1
Reputation: 723
You did not specify all your requirements but this works for your current problem. The program prints out 9 and 42
, which are the beginning of two occurrences of the test text
.
import re
filt = re.compile("test text")
for match in filt.finditer('This is atest textwith a problem. another test text'):
print match.start()
Upvotes: 4
Reputation: 43544
If you're looking for the position of the text within the string, you can use string.find()
.
>>> query_string = 'test text'
>>> text = 'This is atest textwith a problem'
>>> if query_string in text:
print text.find(query_string)
9
Upvotes: 2