Reputation: 161
I have a data set in R studio (Aud) that looks like the following. ID is of type Character and Function is of type character as well
ID Function
F04 FZ000TTY WB002FR088DR011
F05 FZ000AGH WZ004ABD
F06 FZ0005ABD
my goal is to attempt and extract only the "FZ", "TTY", "WB", "FR", "WZ", "ABD" from all the rows in the data set and place them in a new unique column in the data set so that i have something like the following as an example
ID Function SUBFUN1 SUBFUN2 SUBFUN3 SUBFUN4 SUBFUN5
F04 FZ000TTY WB002FR088DR011 FZ TTY WB FR DR
I want to individualize the functions since they represent a certain behavior and that way i can plot per ID the behavior or functions which occur the most over a course of time
I tried the the following
Aud$Subfun1<-
ifelse(grepl("FZ",Aud$Functions.NO.)==T,"FZ", "Other"))
Aud$Subfun2<-
ifelse(grepl("TTY",Aud$Functions.NO.)==T,"TTY","Other"))
I get the error message below in my attempts for subfun1 & subfun2:
Error in `$<-.data.frame`(`*tmp*`, Subfun1, value = logical(0)) :
replacement has 0 rows, data has 343456
Error in `$<-.data.frame`(`*tmp*`, Subfun2, value = logical(0)) :
replacement has 0 rows, data has 343456
I also tried substring() but substring seems to require a start and an end for the character range that needs to be captured in the new column. This is not ideal as the codes FZ, TTY, WB, FR, WZ and ABD all appear at different parts of the function string
Any help would be greatly appreciated with this
Upvotes: 0
Views: 733
Reputation: 16842
A tidyverse
way that makes use of stringr::str_extract_all
to get a nested list of all occurrences of the search terms, then spreads into the wide format you have as your desired output. If you were extracting any sets of consecutive capital letters, you could use "[A-Z]+"
as your search term, but since you said it was these specific IDs, you need a more specific search term. If putting the regex becomes cumbersome, say if you have a vector of many of these IDs, you could paste it together and collapse by |
.
library(tidyverse)
Aud <- data_frame(
ID = c("F04", "F05", "F06"),
Function = c("FZ000TTY WB002FR088DR011", "FZ000AGH WZ004ABD", "FZ0005ABD")
)
search_terms <- "(FZ|TTY|WB|FR|WZ|ABD)"
Aud %>%
mutate(code = str_extract_all(Function, search_terms)) %>%
select(-Function) %>%
unnest(code) %>%
group_by(ID) %>%
mutate(subfun = row_number()) %>%
spread(key = subfun, value = code, sep = "")
#> # A tibble: 3 x 5
#> # Groups: ID [3]
#> ID subfun1 subfun2 subfun3 subfun4
#> <chr> <chr> <chr> <chr> <chr>
#> 1 F04 FZ TTY WB FR
#> 2 F05 FZ WZ ABD <NA>
#> 3 F06 FZ ABD <NA> <NA>
Created on 2018-07-11 by the reprex package (v0.2.0).
Upvotes: 0
Reputation: 20095
One can use tidyr::separate
to divide Function
column in multiple columns using regex
as separator.
library(tidyverse)
df %>%
separate(Function, into = paste("V",1:5, sep=""),
sep = "([^[:alpha:]]+)", fill="right", extra = "drop")
# ID V1 V2 V3 V4 V5
# 1 F04 FZ TTY WB FR DR
# 2 F05 FZ AGH WZ ABD <NA>
# 3 F06 FZ ABD <NA> <NA> <NA>
([^[:alpha:]]+)
: Separate on anything other than alphabates
Data:
df <- read.table(text=
"ID Function
F04 'FZ000TTY WB002FR088DR011'
F05 'FZ000AGH WZ004ABD'
F06 FZ0005ABD",
header = TRUE, stringsAsFactors = FALSE)
Upvotes: 0
Reputation: 33488
Using data.table:
library(data.table)
Aud <- data.frame(
ID = c("F04", "F05", "F06"),
Function = c("FZ000TTY WB002FR088DR011", "FZ000AGH WZ004ABD", "FZ0005ABD"),
stringsAsFactors = FALSE
)
setDT(Aud)
cbind(Aud, Aud[, tstrsplit(Function, "[0-9]+| ")])
ID Function V1 V2 V3 V4 V5
1: F04 FZ000TTY WB002FR088DR011 FZ TTY WB FR DR
2: F05 FZ000AGH WZ004ABD FZ AGH WZ ABD <NA>
3: F06 FZ0005ABD FZ ABD <NA> <NA> <NA>
Staying in base R one could do something like the following:
our_split <- strsplit(Aud$Function, "[0-9]+| ")
cbind(
Aud,
do.call(rbind, lapply(our_split, "length<-", max(lengths(our_split))))
)
Upvotes: 3