Reputation: 185
I have a string - which is chain of emails, i needed to extract the name of the sender (From :)
. Find below a sample of email
str1 <- 'From : Wendy YEOW (SLA) To : [email protected] Subject : RE: OneService@S
From: SLA Enquiry (SLA) Sent: Friday, 5 June, 2015 5:26 PM To : [email protected] Subject : RE: OneService@S
From: Siti Zaharah RAMAN (ARKS) Sent: Friday, 5 June, 2015 5:26 PM To : [email protected] Subject : RE: OneService@S
From: SLA Enquiry (SLA) Sent: Friday, 5 June, 2015 5:26 PM To : [email protected] Subject : RE: OneService@S
From: Chin Hwang LAU (TA) Sent: Friday, 5 June, 2015 5:26 PM To : [email protected] Subject : RE: OneService@S'
I have the below code - to extract the names
str_extract_all(string=str1,pattern="\\b(From\\s*[:]+\\s*(\\w*))\\b")[[1]]
[1] "From : Wendy" "From: SLA" "From: Siti" "From: SLA" "From: Chin"
But my desired output is:
[1] "Wendy YEOW (SLA)" "SLA Enquiry (SLA)" "Siti Zaharah RAMAN (ARKS)" "SLA Enquiry (SLA)" "Chin Hwang LAU (TA)"
Upvotes: 2
Views: 526
Reputation: 81753
You can use strsplit
. There's no need for gsub
here.
strsplit(str1, "From ?: | (To|Sent) ?:.*?(\\nFrom ?: |$)")[[1]][-1]
# [1] "Wendy YEOW (SLA)" "SLA Enquiry (SLA)" "Siti Zaharah RAMAN (ARKS)"
# [4] "SLA Enquiry (SLA)" "Chin Hwang LAU (TA)"
The regex basically consists of two parts:
"From ?: "
: This ist the beginning of the string. The split returns an empty string and the rest of the original string." (To|Sent) ?:.*?(\\nFrom ?: |$)"
: This regex represents the text after the name. It includes the substring starting with "To"
or "Sent"
and ending with a line break ("\\n"
) followed by the next "From"
or the end of the string ("$"
).Finally, the [-1]
is necessary to remove the empty string (preceding the first "From"
).
Upvotes: 3
Reputation: 179588
Try this regular expression together with strsplit()
:
gsub("From *: (.*?) (To|Sent).*", "\\1", strsplit(str1, "\n")[[1]])
[1] "Wendy YEOW (SLA)"
[2] "SLA Enquiry (SLA)"
[3] "Siti Zaharah RAMAN (ARKS)"
[4] "SLA Enquiry (SLA)"
[5] "Chin Hwang LAU (TA)"
This works because I am using a back reference (\\1
) to extract the wildcard in the first set of parentheses.
Upvotes: 3
Reputation: 24520
Not much elegant, but you can try:
gsub(" *(From|To|Sent) *:? *","",regmatches(str1,gregexpr("From *:[^:]+",str1))[[1]])
#[1] "Wendy YEOW (SLA)" "SLA Enquiry (SLA)"
#[3] "Siti Zaharah RAMAN (ARKS)" "SLA Enquiry (SLA)"
#[5] "Chin Hwang LAU (TA)"
Upvotes: 1