Reputation: 1200
I have a data frame consisting of columns that capture different events. Respondents fill in the ages they experience those events. For any given event which they have experienced multiple times, they separate the age experienced with a semi-colon (e.g. if they experience it at 5, 6, 7 years old, they will input 5; 6; 7 in that particular column). For events that they have not experienced, respondents leave it with a blank.
As there are more than twenty columns, I concatenated all the columns together into 1 single column, resulting in a character column. I want to extract the smallest number in that character string. I am unable to force the columns into a numeric data type as some events will be experienced by respondents multiple times and be interpreted by R as a character string (e.g. "5; 6; 7")
My data looks something like the following:
df <- data.frame(ID = c("001", "002", "003", "004"),
concatenated = c("NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA",
"3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA",
"NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA",
"NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9"))
df$concatenated <- as.character(df$concatenated)
The end result I am trying to get is the following:
ID concatenated smallest_number
1 001 NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA 4
2 002 3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA 3
3 003 NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA 2
4 004 NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9 4
Thanks! Much appreciated!
Upvotes: 2
Views: 678
Reputation: 40011
With tidyverse
and splitstackshape
you can do:
df %>%
mutate(temp = gsub(";", "_", concatenated),
temp = gsub(" ", "", temp)) %>%
cSplit("temp", sep = "_") %>%
gather(var, val, -c(concatenated, ID)) %>%
group_by(ID) %>%
mutate(res = min(val, na.rm = TRUE)) %>%
spread(var, val) %>%
select(ID, concatenated, res)
ID concatenated res
<fct> <chr> <dbl>
1 001 NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA 4.
2 002 3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA 3.
3 003 NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA 2.
4 004 NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9 4.
First, it replaces ;
with _
and splits the "concatenated" column according _
. Second, it transforms the data from wide to long format and groups by "ID" column. Finally, it assess the minimum value and returns the data back to wide format.
Or with just tidyverse
:
df %>%
mutate(temp = gsub(";", "_", concatenated),
temp = gsub(" ", "", temp),
temp = strsplit(temp, "_")) %>%
unnest(temp) %>%
group_by(ID) %>%
mutate_if(is.character, as.numeric) %>%
mutate(res = min(temp, na.rm = TRUE),
rowid = row_number()) %>%
spread(rowid, temp) %>%
select(ID, concatenated , res)
ID concatenated res
<fct> <fct> <dbl>
1 001 NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA 4.
2 002 3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA 3.
3 003 NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA 2.
4 004 NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9 4.
Upvotes: 2
Reputation: 16089
Suppose your data is structured like:
DF <- data.frame(ID = 1:4,
age = c("5", "5;6;7", "20;15;12", "2;4"),
stringsAsFactors = FALSE)
You could use strsplit
to split out each age as a number, then take the minimum in the usual way:
DF$min_age <- vapply(strsplit(DF$age, split = "[^0-9]"),
function(x) min(as.numeric(x), na.rm = TRUE),
double(1))
If numbers sometimes don't appear, just exclude those rows
i <- grep("[0-9]", DF$age) # rows with numbers somewhere
DF$min_age <- NA_character_
DF$min_age[i] <- vapply(strsplit(DF$age[i], split = "[^0-9]"),
function(x) min(as.numeric(x), na.rm = TRUE),
double(1))
Upvotes: 1
Reputation: 33508
library(stringr)
df$smallest_number <- sapply(
str_extract_all(df$concatenated, "[0-9]+"),
function(x) min(as.integer(x))
)
df
ID concatenated smallest_number
1 001 NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA 4
2 002 3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA 3
3 003 NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA 2
4 004 NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9 4
Upvotes: 1
Reputation: 47320
We can modify the elements with gsub
to have a single underscore separated string for each item, and then use scan
and min
on them.
df$smallest_number <- sapply(df$concatenated, function(x){
min(scan(text=gsub("; ","_",x), what = numeric(), sep="_"),na.rm=TRUE)})
df
# ID concatenated smallest_number
# 1 001 NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA 4
# 2 002 3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA 3
# 3 003 NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA 2
# 4 004 NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9 4
Upvotes: 1