Hien.Ha
Hien.Ha

Reputation: 51

Re.search for comma/delimiter-removed substring

I have a text and used a function to extract a part of the text. However, in the returned value, delimiters (e.g ',', '-') are removed. I need to find the extracted part in the original text including substring and position. e.g:

original_text = "xyz, 19900 Praha 9, Letnany"
(or original_text = "xyz, 19900 Praha 9 - Letnany")
extracted_text = "praha 9 letnany" (lower case, delimiters are removed)

I expect the output is the same as the ouput of re.search('praha 9, letnany', original_text) meaning getting the substring 'Praha 9, Letnany' and start of the match: 11.

Is there any regular expression to locate extracted text in the original text?

Upvotes: 1

Views: 125

Answers (2)

Hien.Ha
Hien.Ha

Reputation: 51

Same idea as @ScottHunter but process at word level instead of character level:

import re

ori_txt  = '19900, Praha 7, Letnany'
extr_txt = 'praha 7 letnany'

delimiters = [',', '\s', '-']

deli = '|'.join([i for i in delimiters])
extr_arr = re.split(deli, extr_txt)

ins_c = ''.join([i for i in delimiters])
ins_c = ''.join(['[', ins_c, ']', '*'])

pat = ins_c.join(extr_arr)
mat = re.search(pat, ori_txt, re.I)
if mat:
    print mat.group()
else:
    print('not found')

I first want to find a regular expression to directly search for the extracted text in the original text but there seem to be no such an expression. Here is another way to solve my problem. Thank you.

Upvotes: 0

Scott Hunter
Scott Hunter

Reputation: 49813

This will locate a span in the original text that matches the extracted text ignoring case & inserting delimiters at will (in this case, comma or dash):

import re

pat = ("[,-]*".join(list(extracted_text))).replace(" ","\\s")

mat = re.search( pat, original_text, re.I )
if mat:
    print(mat.span())
else:
    print("No match")

Upvotes: 2

Related Questions