new_learner
new_learner

Reputation: 77

extract names from string

I have a string as:

s="(2021-07-29 01:00:00 AM BST)  
---  
peter.j.matthew has joined the conversation  
  
  

(2021-07-29 01:00:00 AM BST)  
---  
john cheung has joined the conversation  
  
  


(2021-07-29 01:11:19 AM BST)  
---  
allen.p.jonas  
Hi, james  
  
  
(2021-07-30 12:51:16 AM BST)  
---  
karren wenda  
how're you ? 
  
  
  
---  
  
* * *"

I want to extract the names as:

names_list= ['allen.p.jonas','karren wenda']

what I have tried:

names_list=re.findall(r'--- [\S\n](\D+ [\S\n])',s)

Upvotes: 2

Views: 492

Answers (3)

The fourth bird
The fourth bird

Reputation: 163207

If you only want to match ['allen.p.jonas','karren wenda'], you can use match a non whitespace char after it on the next line:

^---[^\S\n]*\n(\S.*?)[^\S\r\n]*\n\S

The pattern matches:

  • ^ Start of string
  • --- Match ---
  • [^\S\n]*\n Match optional spaces and a newline
  • (\S.*?) Capture group 1 (returned by re.findall) match a non whitespace char followed by as least as possible chars
  • [^\S\r\n]* Match optional whitespace chars without a newline
  • \n\S Match a newline and a non whitespace char

Regex demo | Python demo

For example

print(re.findall(r"^---[^\S\n]*\n(\S.*?)[^\S\r\n]*\n\S", s, re.M))

Output

['allen.p.jonas', 'karren wenda']

To explicitly exclude lines that contain has joined the conversation you can use a negative lookahead:

^---[^\S\n]*\n(?!.*\bhas joined the conversation\b)(\S.*?)[^\S\r]*$

Regex demo | Python demo

For example:

print(re.findall(r"^---[^\S\n]*\n(?!.*\bhas joined the conversation\b)(\S.*?)[^\S\r]*$", s, re.M))

Output

['allen.p.jonas', 'karren wenda']

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520918

This answer assumes that you want to find names on whose lines do not end with the text has joined the conversation:

names = re.findall(r'\(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} [AP]M [A-Z]{3}\)\s+---\s+\r?\n((?:(?!\bhas joined the conversation).)+?)[ ]*\r?\n', s)
print(names)  # ['allen.p.jonas', 'karren wenda']

The salient portion of the regex is this:

((?:(?!\bhas joined the conversation).)+?)[ ]*\r?\n

This captures a name without matching has joined the conversation by using a tempered dot trick. It matches one character at a time on the line containing the name, making sure that the conversation text does not appear anywhere, until reaching the CR?LF at the end of the line.

Upvotes: 1

Tranbi
Tranbi

Reputation: 12701

Supposing you want to match names that are not followed by "has joined the conversation":

name_pattern = re.compile(r'---\s*\n(\w(?:[\w\. ](?!has joined the conversation))*?)\s*\n', re.MULTILINE)
print(re.findall(name_pattern, s))

Explanation:

  • ---\s*\n matches the dashes possibly followed by whitespaces and a required new line

  • Then comes our matching group composed of:

    • \w starts with a 'word' character (a-Z, 0-9 or _)
    • (?:[\w\. ](?!has joined the conversation))*? a non capturing group of repeating \w, . or whitespace not followed by "has joined the conversation". The capturing goes on until the next whitespace or new line. (*? makes the expression lazy instead of greedy)

Output:

['allen.p.jonas', 'karren wenda']

Upvotes: 0

Related Questions