vbala2014
vbala2014

Reputation: 161

R studio - using grepl() to grab specific characters and populate a new column in the dataframe

I have a data set in R studio (Aud) that looks like the following. ID is of type Character and Function is of type character as well

ID                         Function
F04                        FZ000TTY WB002FR088DR011
F05                        FZ000AGH WZ004ABD
F06                        FZ0005ABD

my goal is to attempt and extract only the "FZ", "TTY", "WB", "FR", "WZ", "ABD" from all the rows in the data set and place them in a new unique column in the data set so that i have something like the following as an example

ID     Function                  SUBFUN1  SUBFUN2  SUBFUN3  SUBFUN4 SUBFUN5
F04    FZ000TTY WB002FR088DR011  FZ       TTY      WB       FR      DR

I want to individualize the functions since they represent a certain behavior and that way i can plot per ID the behavior or functions which occur the most over a course of time

I tried the the following

Aud$Subfun1<-
ifelse(grepl("FZ",Aud$Functions.NO.)==T,"FZ", "Other"))

Aud$Subfun2<-
ifelse(grepl("TTY",Aud$Functions.NO.)==T,"TTY","Other"))

I get the error message below in my attempts for subfun1 & subfun2:

Error in `$<-.data.frame`(`*tmp*`, Subfun1, value = logical(0)) : 
  replacement has 0 rows, data has 343456

 Error in `$<-.data.frame`(`*tmp*`, Subfun2, value = logical(0)) : 
      replacement has 0 rows, data has 343456

I also tried substring() but substring seems to require a start and an end for the character range that needs to be captured in the new column. This is not ideal as the codes FZ, TTY, WB, FR, WZ and ABD all appear at different parts of the function string

Any help would be greatly appreciated with this

Upvotes: 0

Views: 733

Answers (3)

camille
camille

Reputation: 16842

A tidyverse way that makes use of stringr::str_extract_all to get a nested list of all occurrences of the search terms, then spreads into the wide format you have as your desired output. If you were extracting any sets of consecutive capital letters, you could use "[A-Z]+" as your search term, but since you said it was these specific IDs, you need a more specific search term. If putting the regex becomes cumbersome, say if you have a vector of many of these IDs, you could paste it together and collapse by |.

library(tidyverse)
Aud <- data_frame(
  ID = c("F04", "F05", "F06"), 
  Function = c("FZ000TTY WB002FR088DR011", "FZ000AGH WZ004ABD", "FZ0005ABD")
)

search_terms <- "(FZ|TTY|WB|FR|WZ|ABD)"

Aud %>%
  mutate(code = str_extract_all(Function, search_terms)) %>%
  select(-Function) %>%
  unnest(code) %>%
  group_by(ID) %>%
  mutate(subfun = row_number()) %>%
  spread(key = subfun, value = code, sep = "")
#> # A tibble: 3 x 5
#> # Groups:   ID [3]
#>   ID    subfun1 subfun2 subfun3 subfun4
#>   <chr> <chr>   <chr>   <chr>   <chr>  
#> 1 F04   FZ      TTY     WB      FR     
#> 2 F05   FZ      WZ      ABD     <NA>   
#> 3 F06   FZ      ABD     <NA>    <NA>

Created on 2018-07-11 by the reprex package (v0.2.0).

Upvotes: 0

MKR
MKR

Reputation: 20095

One can use tidyr::separate to divide Function column in multiple columns using regex as separator.

library(tidyverse)

df %>% 
  separate(Function, into = paste("V",1:5, sep=""),  
           sep = "([^[:alpha:]]+)", fill="right", extra = "drop")

#    ID V1  V2   V3   V4   V5
# 1 F04 FZ TTY   WB   FR   DR
# 2 F05 FZ AGH   WZ  ABD <NA>
# 3 F06 FZ ABD <NA> <NA> <NA>

([^[:alpha:]]+) : Separate on anything other than alphabates

Data:

df <- read.table(text=
"ID                         Function
F04                        'FZ000TTY WB002FR088DR011'
F05                        'FZ000AGH WZ004ABD'
F06                        FZ0005ABD",
header = TRUE, stringsAsFactors = FALSE)

Upvotes: 0

s_baldur
s_baldur

Reputation: 33488

Using data.table:

library(data.table)
Aud <- data.frame(
  ID = c("F04", "F05", "F06"), 
  Function = c("FZ000TTY WB002FR088DR011", "FZ000AGH WZ004ABD", "FZ0005ABD"),
  stringsAsFactors = FALSE
)
setDT(Aud)

cbind(Aud, Aud[, tstrsplit(Function, "[0-9]+| ")])
    ID                 Function V1  V2   V3   V4   V5
1: F04 FZ000TTY WB002FR088DR011 FZ TTY   WB   FR   DR
2: F05        FZ000AGH WZ004ABD FZ AGH   WZ  ABD <NA>
3: F06                FZ0005ABD FZ ABD <NA> <NA> <NA>

Staying in base R one could do something like the following:

our_split <- strsplit(Aud$Function, "[0-9]+| ")

cbind(
  Aud,
  do.call(rbind, lapply(our_split, "length<-", max(lengths(our_split))))
)

Upvotes: 3

Related Questions