Dataholic
Dataholic

Reputation: 123

Extract text between a string and new line character (/n) using regex

I have a text file and want to extract texts between two strings ("StartString" and "EndString" in below example) if a substring exists between those two strings. There may be multiple such instances in the text file. After this, I want to extract first occurrence of the "someID" in those multiple instances. For example,

Example text file (text_data)

 
ghguy  hja  StartString I want this text (1) if substring 1 lies in between the two strings someID: abcd_efgh
ghsjgsajhgj someID: dgfshgj
EndString bhghk [jhbn] xxzh StartString I want this text (2) as a different variable if substring 2 lies in between the two strings ghdsdjsagdsh someID: fhcb7hkhb
ghjxcgsydgsdycgsjxcskcsal someID: ghyoet_fstj
EndString ghjyjgu   

Output:

first_variable = I want this text (1) if substring 1 lies in between the two strings someID: abcd_efgh ghsjgsajhgj someID: dgfshgj

second_variable = I want this text (2) as a different variable if substring 2 lies in between the two strings ghdsdjsagdsh someID: fhcb7hkhb ghjxcgsydgsdycgsjxcskcsal someID: ghyoet_fstj

first occurrence of someID in first_variable = abcd_efgh

first occurrence of someID in first_variable = fhcb7hkhb

I tried extracting the first variable as:

target1 = 'StartString'

target2 = 'EndString'

pat1 = '{}(.+?){}'.format(target1,target2)

pattern = re.compile(pat1, flags=re.DOTALL)

first_variable = pattern.findall(text_data)

I have no clue how to extract first occurrence of the someID in each instance. Can anyone help me out in this.

Upvotes: 1

Views: 1398

Answers (1)

CinCout
CinCout

Reputation: 9619

Do this:

StartString[\s\S]*?someID:\s*(\S*)\b[\s\S]*?EndString

Look for someID: in between StartString and EndString and capture the word until word boundary is encountered.

Demo

Upvotes: 1

Related Questions