Loons22
Loons22

Reputation: 373

Removing all characters before and after text in R, then creating columns from the new text

So I have a string that I'm attempting to parse through and then create 3 columns with the data I extract. From what I've seen, stringr doesn't really cover this case and the gsub I've used so far is excessive and involves me making multiple columns, parsing from those new columns, and then removing them and that seems really inefficient.

The format is this:

"blah, grabbed by ???-??-?????."

I need this:

???-??-?????

I've used placeholders here, but this is how the string typically looks

"blah, grabbed by PHI-80-J.Matthews."

or

"blah, grabbed by NE-5-J.Mills."

and sometimes there is text after the name like this:

"blah, grabbed by KC-10-T.Hill. Blah blah blah."

This is what I would like the end result to be:

Place Number Name
PHI 80 J.Matthews
NE 5 J.Mills
KC 10 T. Hill

Edit for further explanation:

Most strings include other people in the same format so "downed by" needs to be incorporated in someway to make sure it is grabbing the right name.

Ex.

"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"

Desired Output:

Place Number Name
KC 10 T. Hill

Upvotes: 2

Views: 542

Answers (3)

TarJae
TarJae

Reputation: 78917

Here is an alternative way using sub with regex "([A-Za-z]+\\.[A-Za-z]+).*", "\\1" that removes the string after the second point. separate that splits the string by by, and finally again separate to get the desired columns.

library(dplyr)
library(tidyr)

df1 %>% 
    mutate(test1 = sub("([A-Za-z]+\\.[A-Za-z]+).*", "\\1", col1)) %>% 
    separate(test1, c('remove', 'keep'), sep = " by ") %>% 
    separate(keep, c("Place", "Number", "Name"), sep = "-") %>% 
    select(Place, Number, Name)

Output:

  Place Number Name      
  <chr> <chr>  <chr>     
1 PHI   80     J.Matthews
2 NE    5      J.Mills   
3 KC    10     T.Hill

Upvotes: 2

akrun
akrun

Reputation: 887028

This solution simply extract the components based on the logic OP mentioned i.e. capture the characters that are needed as three groups - 1) one or more upper case letter ([A-Z]+) followed by a dash (-), 2) then one or more digits (\\d+), and finally 3) non-whitespace characters (\\S+) that follow the dash

library(tidyr)
extract(df1, col1, into = c("Place", "Number", "Name"), 
      ".*grabbed by\\s([A-Z]+)-(\\d+)-(\\S+)\\..*", convert = TRUE)

-ouputt

# A tibble: 4 x 3
  Place Number Name      
  <chr>  <int> <chr>     
1 PHI       80 J.Matthews
2 NE         5 J.Mills   
3 KC        10 T.Hill    
4 KC        10 T.Hill    

Or do this in base R

read.table(text = sub(".*grabbed by\\s((\\w+-){2}\\S+)\\..*", "\\1", 
    df1$col1), header = FALSE, col.names = c("Place", "Number", "Name"), sep='-')
  Place Number       Name
1   PHI     80 J.Matthews
2    NE      5    J.Mills
3    KC     10     T.Hill

data

df1 <- structure(list(col1 = c("blah, grabbed by PHI-80-J.Matthews.", 
"blah, grabbed by NE-5-J.Mills.", "blah, grabbed by KC-10-T.Hill. Blah blah blah.", 
"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))

Upvotes: 3

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

This solution actually does what you say in the title, namely first remove the text around the the target substring, then split it into columns:

library(tidyr)
library(stringr)
df1 %>%
  mutate(col1 = str_extract(col1, "\\w+-\\w+-\\w\\.\\w+")) %>%
  separate(col1, 
           into = c("Place", "Number", "Name"), 
           sep = "-")
# A tibble: 3 x 3
  Place Number Name      
  <chr> <chr>  <chr>     
1 PHI   80     J.Matthews
2 NE    5      J.Mills   
3 KC    10     T.Hill 

Here, we make use of the fact that the character class \\w is for letters irrespective of case and for digits (and also for the underscore).

Upvotes: 2

Related Questions