Reputation: 77
I have a string as:
s="(2021-07-29 01:00:00 AM BST)
---
peter.j.matthew has joined the conversation
(2021-07-29 01:00:00 AM BST)
---
john cheung has joined the conversation
(2021-07-29 01:11:19 AM BST)
---
allen.p.jonas
Hi, james
(2021-07-30 12:51:16 AM BST)
---
karren wenda
how're you ?
---
* * *"
I want to extract the names as:
names_list= ['allen.p.jonas','karren wenda']
what I have tried:
names_list=re.findall(r'--- [\S\n](\D+ [\S\n])',s)
Upvotes: 2
Views: 492
Reputation: 163207
If you only want to match ['allen.p.jonas','karren wenda']
, you can use match a non whitespace char after it on the next line:
^---[^\S\n]*\n(\S.*?)[^\S\r\n]*\n\S
The pattern matches:
^
Start of string---
Match ---
[^\S\n]*\n
Match optional spaces and a newline(\S.*?)
Capture group 1 (returned by re.findall) match a non whitespace char followed by as least as possible chars[^\S\r\n]*
Match optional whitespace chars without a newline\n\S
Match a newline and a non whitespace charFor example
print(re.findall(r"^---[^\S\n]*\n(\S.*?)[^\S\r\n]*\n\S", s, re.M))
Output
['allen.p.jonas', 'karren wenda']
To explicitly exclude lines that contain has joined the conversation
you can use a negative lookahead:
^---[^\S\n]*\n(?!.*\bhas joined the conversation\b)(\S.*?)[^\S\r]*$
For example:
print(re.findall(r"^---[^\S\n]*\n(?!.*\bhas joined the conversation\b)(\S.*?)[^\S\r]*$", s, re.M))
Output
['allen.p.jonas', 'karren wenda']
Upvotes: 1
Reputation: 520918
This answer assumes that you want to find names on whose lines do not end with the text has joined the conversation
:
names = re.findall(r'\(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} [AP]M [A-Z]{3}\)\s+---\s+\r?\n((?:(?!\bhas joined the conversation).)+?)[ ]*\r?\n', s)
print(names) # ['allen.p.jonas', 'karren wenda']
The salient portion of the regex is this:
((?:(?!\bhas joined the conversation).)+?)[ ]*\r?\n
This captures a name without matching has joined the conversation
by using a tempered dot trick. It matches one character at a time on the line containing the name, making sure that the conversation
text does not appear anywhere, until reaching the CR?LF at the end of the line.
Upvotes: 1
Reputation: 12701
Supposing you want to match names that are not followed by "has joined the conversation":
name_pattern = re.compile(r'---\s*\n(\w(?:[\w\. ](?!has joined the conversation))*?)\s*\n', re.MULTILINE)
print(re.findall(name_pattern, s))
Explanation:
---\s*\n
matches the dashes possibly followed by whitespaces and a required new line
Then comes our matching group composed of:
\w
starts with a 'word' character (a-Z, 0-9 or _)(?:[\w\. ](?!has joined the conversation))*?
a non capturing group of repeating \w
, .
or whitespace not followed by "has joined the conversation". The capturing goes on until the next whitespace or new line. (*?
makes the expression lazy instead of greedy)Output:
['allen.p.jonas', 'karren wenda']
Upvotes: 0