Correct offset of characters with extra white spaces, newlines, etc

Question

I am trying to implement a simple solution for offset correction due to 'junk' characters, for example:

string_1 = "London is the capital of the UK"

-> chars location: ("capital", 14,21) and ("UK", 29, 31)

however, in the presence of newlines etc, the char location are changed:

string_2 = "London is the capital of the UK"

-> chars location ("capital" are (21,28)), ("UK" are (36,38))

moreover, the string may contain any number of newlines and other artefacts as well as any number of key-words.

My question is, given a text with extra characters (ASCII, newlines, etc), and some cleaning function, how to adjust the locations of certain keywords in the cleaned text?

string_2 = cleaning_txt(string_1)

-> ("capital", 14, 21) --> ("capital", 21, 28) -> ("UK", 29, 31) --> ("UK", 36, 38)

yatu · Accepted Answer

str.find works fine for this purpose:

string_1 = "London is the capital of the UK"
string_2 = "London       is the

capital of the UK"

def find_pos(s, match):
    pos = s.find(match)
    return (pos, pos+len(match))

match = 'capital'

find_pos(string_1, match)
# (14, 21)

find_pos(string_2, match)
# (21, 28)

Correct offset of characters with extra white spaces, newlines, etc

Answers (2)

Related Questions