Reputation: 51
I would like to extract strings following certain words using data.table.
Subject: From: To: Date: Message:
Expected Input: Subject: Welcome \r\nFrom: (Jane Doe) [email protected]\r\nTo: (Foo Bar) [email protected]\r\nDate: 1/1/2019 7:01:32 AM\r\n\r\n Sent from my iPhone\r\n\r\nBegin forwarded message:\r\n\r\nFrom: Mr. X
I have tried a few functions, but cannot get the code to pull only the first instance of the string and ignore subsequent strings. I am also having issues with capturing only the sections I am looking for.
library(data.table)
x<- as.data.table("Subject: Welcome \r\nFrom: (Jane Doe) [email protected]\r\nTo:
(Foo Bar) [email protected]\r\nDate: 1/1/2019 7:01:32 AM\r\n\r\n Sent from my iPhone\r\n\r\nBegin forwarded message:\r\n\r\nFrom: Mr. X <[email protected]","x1")
x[, Subject := sub('^.*Subject:\\s*|\\s*From:.*$', '', V1) ][]
x[, From := sub('^.*From:\\s*|\\s*To:.*$', '', V1) ][]
x[, To := sub('^.*To:\\s*|\\s*Date:.*$', '', V1) ][]
x[, Message := sub('^.*PM|AM\\s*|\\s*.*$', '', V1) ][]
x
Current Results: V1 Subject: Welcome \r\nFrom: (Jane Doe) [email protected]\r\nTo: \n (Foo Bar) [email protected]\r\nDate: 1/1/2019 7:01:32 AM\r\n\r\n Sent from my iPhone\r\n\r\nBegin forwarded message:\r\n\r\nFrom: Mr. X
From: Mr. X
From: Mr. X
Message: (blank)
Upvotes: 2
Views: 115
Reputation: 388962
We can use tidyr::extract
to divide the data into 4 columns after using gsub
to remove \r\n
.
library(dplyr)
x %>%
mutate(V1 = gsub("\r|\n", "", V1)) %>%
tidyr::extract(V1, into = c("Subject", "From", "To", "Date"),
regex = ".*Subject:(.*)From:(.*)To:(.*)Date:(.*)A|PM.*")
# Subject From To
#1 Welcome (Jane Doe) [email protected] (Foo Bar) [email protected]
# Date
# 1 1/1/2019 7:01:32
Upvotes: 1
Reputation: 79208
You can use Base R strcapture
function:
prot = data.frame(setNames(replicate(4,character()),
c("Subject","From","To","Date")),stringsAsFactors = F)
patt = "Subject:\\s*(.*?)\\s*From:\\s*(.*?)\\s*To:\\s*(.*?)\\s*Date:\\s*(.*(?:A|P)M)"
strcapture(patt,x$V1,prot)
Subject From To Date
1 Welcome (Jane Doe) [email protected] (Foo Bar) [email protected] 1/1/2019 7:01:32 AM
Upvotes: 3