Learner
Learner

Reputation: 691

extract newline text from string

I have string as mentioned below:

string=" (2021-07-04 11:58:43 PM BST)
--- len (Tradition ) says to sen Hi yohan

(2021-07-05 12:04:42 AM BST)
--- len (Tradition) says to yohan okay -5 / 0 .

(2021-07-04 11:47:14 PM BST)
--- Ke Ch says to Hano hello

(2021-07-05 12:09:41 AM BST)
--- len says to yohan sen yes -5 / 0 TN -- / +2.5



Processed by wokl Archive for son malab | 2021-07-05 12:26:44 AM
BST  
---"

All I want to extract the text after says to and before timestamp.

Expected output as:

text=['yohan sen Hi yohan','yohan sen okay -5 / 0 ','Han Cho hello','sen yes -5 / 0 TN -- / +2.5']

What I have tried:

text=re.findall(r'\bsays to (.*(?:\n(?!\(\d|---).*?)*?)\s*\n(?:\(\d|---)', string)

Upvotes: 3

Views: 564

Answers (3)

viv3k
viv3k

Reputation: 623

(?<=says\sto)[\s\S]*?(?=\(\d{4}-\d{2}-\d{2}\s(\d\d:){2}\d{2}\s\w{2}\s\w{3}\))

You have to use look ahead and look behind regex for this. To solve your problem, you need one look behind, which is 'says to' and one look ahead which is the date pattern.

  • Syntax for look behind is (?<=fixed_length_regex)
  • Syntax for look ahead is (?=fixed_length_regex)

So essentially what you are looking for would look something like this:

   look-behind  |        pattern          |  look-ahead
________________|_________________________|__________________
                |                         |
(?<=(says\sto)) |  match_everything_here  | (?=date_pattern)

which is equivalent to first regex.

You can play around with the solution in regex101 here: https://regex101.com/r/rPFDo9/1/

Upvotes: 1

RavinderSingh13
RavinderSingh13

Reputation: 133458

With your shown samples, please try following Python code. Written and tested in Python3.

import re
##Create variable here string with user's values, since variable is too long so mentioning it as a comment here....
var1 = re.findall(r'says\s+[^(]*',string,re.M)

Above will create a list named var1 whose elements will have new lines at last of each element, so to remove them use following code then. Using strip function of Python here.

var1 = list(map(lambda s: s.strip(), var1))

Now print the all elements of var1 list:

for element in var1:
    print (element)

Explanation: Explanation of regex would be simple, using re.findall function of Python3 and mentioning regex to match says\s+[^(]* means match from says followed by space(s) just before next/1st occurrence of ( here.

Upvotes: 2

anubhava
anubhava

Reputation: 784998

You may use this regex:

says\s+to\s+((?:.+\n)+)

RegEx Demo

RegEx Details:

  • says\s+to\s+: Matches says to followed by 1+ whitespaces
  • ((?:.+\n)+): Match 1+ non-empty lines and capture in group #1

Python Code:

matches = re.findall(r'says\s+to\s+((?:.+\n)+)', string)

Upvotes: 1

Related Questions