Arnold Klein
Arnold Klein

Reputation: 3086

Correct offset of characters with extra white spaces, newlines, etc

I am trying to implement a simple solution for offset correction due to 'junk' characters, for example:

string_1 = "London is the capital of the UK"

-> chars location: ("capital", 14,21) and ("UK", 29, 31)

however, in the presence of newlines etc, the char location are changed:

string_2 = "London is the\n\ncapital of the\n UK"

-> chars location ("capital" are (21,28)), ("UK" are (36,38))

moreover, the string may contain any number of newlines and other artefacts as well as any number of key-words.

My question is, given a text with extra characters (ASCII, newlines, etc), and some cleaning function, how to adjust the locations of certain keywords in the cleaned text?

string_2 = cleaning_txt(string_1)

-> ("capital", 14, 21) --> ("capital", 21, 28) -> ("UK", 29, 31) --> ("UK", 36, 38)

Upvotes: 1

Views: 198

Answers (2)

yatu
yatu

Reputation: 88266

str.find works fine for this purpose:

string_1 = "London is the capital of the UK"
string_2 = "London       is the\n\ncapital of the UK"

def find_pos(s, match):
    pos = s.find(match)
    return (pos, pos+len(match))

match = 'capital'

find_pos(string_1, match)
# (14, 21)

find_pos(string_2, match)
# (21, 28)

Upvotes: 2

Thomas Schillaci
Thomas Schillaci

Reputation: 2453

You can clean your string this way:

x =  "London       is the\n\ncapital of the UK"
x = x.replace('\n','')

while '  ' in x:
    x = x.replace('  ', ' ')

print(x)

Gives:

London is thecapital of the UK

Upvotes: 0

Related Questions