Reputation: 3086
I am trying to implement a simple solution for offset correction due to 'junk' characters, for example:
string_1 = "London is the capital of the UK"
-> chars location: ("capital", 14,21)
and ("UK", 29, 31)
however, in the presence of newlines etc, the char location are changed:
string_2 = "London is the\n\ncapital of the\n UK"
-> chars location ("capital" are (21,28)), ("UK" are (36,38))
moreover, the string may contain any number of newlines and other artefacts as well as any number of key-words.
My question is, given a text with extra characters (ASCII, newlines, etc), and some cleaning function, how to adjust the locations of certain keywords in the cleaned text?
string_2 = cleaning_txt(string_1)
-> ("capital", 14, 21) --> ("capital", 21, 28)
-> ("UK", 29, 31) --> ("UK", 36, 38)
Upvotes: 1
Views: 198
Reputation: 88266
str.find
works fine for this purpose:
string_1 = "London is the capital of the UK"
string_2 = "London is the\n\ncapital of the UK"
def find_pos(s, match):
pos = s.find(match)
return (pos, pos+len(match))
match = 'capital'
find_pos(string_1, match)
# (14, 21)
find_pos(string_2, match)
# (21, 28)
Upvotes: 2
Reputation: 2453
You can clean your string this way:
x = "London is the\n\ncapital of the UK"
x = x.replace('\n','')
while ' ' in x:
x = x.replace(' ', ' ')
print(x)
Gives:
London is thecapital of the UK
Upvotes: 0