The Maestro
The Maestro

Reputation: 689

Find all the occurences of a string in an imperfect text

I am trying to find a string within a long text extracted from a PDF file, and get the string's position in the text, and then return 100 words before the string and 100 after. The problem is that the extraction is not perfect, so I am having a problem like this:

The query string is "test text"

The text may look like:

This is atest textwith a problem

as you can see the word "test" is joined with the letter "a" and the word "text" is joined with the word "with"

So the only function is working with me is __contains __ which doesn't return the position of the word.

Any ideas to find all the occurences of a word in such a text with their postions?

Thank you very much

Upvotes: 1

Views: 313

Answers (4)

Martin Evans
Martin Evans

Reputation: 46789

You could take the following kind of approach. This first attempts to split the whole text into words, and keeps note of the index of each word.

Next it iterates through the text looking for test text with possible 0 or more spaces between. For each match it notes the start and then creates a list of words found before and after that point using Python's bisect library to locate the required entries in the words list.

import bisect
import re

test = "aa bb cc dd test text ee ff gg testtextwith hh ii jj"

words = [(w.start(), w.group(0)) for w in re.finditer(r'(\b\w+?\b)', test)]

adjacent_words = 2

for match in re.finditer(r'(test\s*?text)', test):
    start, end = match.span()

    words_start = bisect.bisect_left(words, (start, ''))
    words_end = bisect.bisect_right(words, (end, ''))

    words_before = [w for i, w in words[words_start-adjacent_words : words_start]]
    words_after = [w for i, w in words[words_end : words_end + adjacent_words]]

    #  Adjacent words as a list
    print words_before, match.group(0), words_after

    # Or, surrounding text as is.
    print test[words[words_start-adjacent_words][0] : words[words_end+adjacent_words][0]]

    print

So for this example with 2 adjacent words, you would get the following output:

['cc', 'dd'] test text ['ee', 'ff']
cc dd test text ee ff 

['ff', 'gg'] testtext ['hh', 'ii']
ff gg testtextwith hh ii

Upvotes: 3

dawg
dawg

Reputation: 104102

You might have a look at the regex module which allows for 'fuzzy' matching:

>>> import regex
>>> s='This is atest textwith a problem'
>>> regex.search(r'(?:text with){e<2}', s)
<regex.Match object; span=(14, 22), match='textwith', fuzzy_counts=(0, 0, 1)>
>>> regex.search(r'(?:test text){e<2}', s)
<regex.Match object; span=(8, 18), match='atest text', fuzzy_counts=(0, 1, 0)>

You can match text that has insertions, deletions, and errors. The match group returned has the span and index.

You can use regex.findall to find all the potential target matches.

Perfect for what you are describing.

Upvotes: 1

Prophecies
Prophecies

Reputation: 723

You did not specify all your requirements but this works for your current problem. The program prints out 9 and 42, which are the beginning of two occurrences of the test text.

import re
filt = re.compile("test text")

for match in filt.finditer('This is atest textwith a problem. another test text'):
    print match.start()

Upvotes: 4

pault
pault

Reputation: 43544

If you're looking for the position of the text within the string, you can use string.find().

>>> query_string = 'test text'
>>> text = 'This is atest textwith a problem'
>>> if query_string in text:
        print text.find(query_string)
9

Upvotes: 2

Related Questions