Jennifer Schleicher
Jennifer Schleicher

Reputation: 51

Best method to extract the first instance of a string between specified keywords using data.table

I would like to extract strings following certain words using data.table.

Subject: From: To: Date: Message:

Expected Input: Subject: Welcome \r\nFrom: (Jane Doe) [email protected]\r\nTo: (Foo Bar) [email protected]\r\nDate: 1/1/2019 7:01:32 AM\r\n\r\n Sent from my iPhone\r\n\r\nBegin forwarded message:\r\n\r\nFrom: Mr. X

I have tried a few functions, but cannot get the code to pull only the first instance of the string and ignore subsequent strings. I am also having issues with capturing only the sections I am looking for.

library(data.table)

x<- as.data.table("Subject: Welcome \r\nFrom: (Jane Doe) [email protected]\r\nTo: 
                  (Foo Bar) [email protected]\r\nDate: 1/1/2019 7:01:32 AM\r\n\r\n Sent from my iPhone\r\n\r\nBegin forwarded message:\r\n\r\nFrom: Mr. X <[email protected]","x1")

x[, Subject := sub('^.*Subject:\\s*|\\s*From:.*$', '', V1) ][]
x[, From := sub('^.*From:\\s*|\\s*To:.*$', '', V1) ][]
x[, To := sub('^.*To:\\s*|\\s*Date:.*$', '', V1) ][]
x[, Message := sub('^.*PM|AM\\s*|\\s*.*$', '', V1) ][]

x

Current Results: V1 Subject: Welcome \r\nFrom: (Jane Doe) [email protected]\r\nTo: \n (Foo Bar) [email protected]\r\nDate: 1/1/2019 7:01:32 AM\r\n\r\n Sent from my iPhone\r\n\r\nBegin forwarded message:\r\n\r\nFrom: Mr. X

From: Mr. X

From: Mr. X

Message: (blank)

Upvotes: 2

Views: 115

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388962

We can use tidyr::extract to divide the data into 4 columns after using gsub to remove \r\n.

library(dplyr)

x %>%
   mutate(V1 = gsub("\r|\n", "", V1)) %>%
   tidyr::extract(V1, into = c("Subject", "From", "To", "Date"), 
          regex = ".*Subject:(.*)From:(.*)To:(.*)Date:(.*)A|PM.*")

#   Subject                                 From                                  To
#1  Welcome   (Jane Doe) [email protected]   (Foo Bar) [email protected]

#                Date
# 1  1/1/2019 7:01:32 

Upvotes: 1

Onyambu
Onyambu

Reputation: 79208

You can use Base R strcapture function:

prot = data.frame(setNames(replicate(4,character()),
               c("Subject","From","To","Date")),stringsAsFactors = F) 

patt = "Subject:\\s*(.*?)\\s*From:\\s*(.*?)\\s*To:\\s*(.*?)\\s*Date:\\s*(.*(?:A|P)M)"

strcapture(patt,x$V1,prot)

  Subject                               From                                To                Date
1 Welcome (Jane Doe) [email protected] (Foo Bar) [email protected] 1/1/2019 7:01:32 AM

Upvotes: 3

Related Questions