DTYK
DTYK

Reputation: 1200

Get smallest number in character string

I have a data frame consisting of columns that capture different events. Respondents fill in the ages they experience those events. For any given event which they have experienced multiple times, they separate the age experienced with a semi-colon (e.g. if they experience it at 5, 6, 7 years old, they will input 5; 6; 7 in that particular column). For events that they have not experienced, respondents leave it with a blank.

As there are more than twenty columns, I concatenated all the columns together into 1 single column, resulting in a character column. I want to extract the smallest number in that character string. I am unable to force the columns into a numeric data type as some events will be experienced by respondents multiple times and be interpreted by R as a character string (e.g. "5; 6; 7")

My data looks something like the following:

df <- data.frame(ID = c("001", "002", "003", "004"),
             concatenated = c("NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA",
                              "3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA",
                              "NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA",
                              "NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9"))

df$concatenated <- as.character(df$concatenated)

The end result I am trying to get is the following:

ID                           concatenated smallest_number
1 001         NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA               4
2 002    3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA               3
3 003 NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA               2
4 004      NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9               4

Thanks! Much appreciated!

Upvotes: 2

Views: 678

Answers (4)

tmfmnk
tmfmnk

Reputation: 40011

With tidyverse and splitstackshape you can do:

df %>%
 mutate(temp = gsub(";", "_", concatenated),
        temp = gsub(" ", "", temp)) %>%
 cSplit("temp", sep = "_") %>%
 gather(var, val, -c(concatenated, ID)) %>%
 group_by(ID) %>%
 mutate(res = min(val, na.rm = TRUE)) %>%
 spread(var, val) %>%
 select(ID, concatenated, res)

  ID    concatenated                             res
  <fct> <chr>                                  <dbl>
1 001   NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA            4.
2 002   3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA       3.
3 003   NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA    2.
4 004   NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9         4.

First, it replaces ; with _ and splits the "concatenated" column according _. Second, it transforms the data from wide to long format and groups by "ID" column. Finally, it assess the minimum value and returns the data back to wide format.

Or with just tidyverse:

df %>% 
 mutate(temp = gsub(";", "_", concatenated),
        temp = gsub(" ", "", temp),
        temp = strsplit(temp, "_")) %>%
 unnest(temp) %>%
 group_by(ID) %>%
 mutate_if(is.character, as.numeric) %>%
 mutate(res = min(temp, na.rm = TRUE),
        rowid = row_number()) %>%
 spread(rowid, temp) %>%
 select(ID, concatenated , res)

  ID    concatenated                             res
  <fct> <fct>                                  <dbl>
1 001   NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA            4.
2 002   3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA       3.
3 003   NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA    2.
4 004   NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9         4.

Upvotes: 2

Hugh
Hugh

Reputation: 16089

Suppose your data is structured like:

DF <- data.frame(ID = 1:4,
                 age = c("5", "5;6;7", "20;15;12", "2;4"),
                 stringsAsFactors = FALSE)

You could use strsplit to split out each age as a number, then take the minimum in the usual way:

DF$min_age <- vapply(strsplit(DF$age, split = "[^0-9]"),
                     function(x) min(as.numeric(x), na.rm = TRUE),
                     double(1))

If numbers sometimes don't appear, just exclude those rows

i <- grep("[0-9]", DF$age)  # rows with numbers somewhere
DF$min_age <- NA_character_
DF$min_age[i] <- vapply(strsplit(DF$age[i], split = "[^0-9]"),
                        function(x) min(as.numeric(x), na.rm = TRUE),
                        double(1))

Upvotes: 1

s_baldur
s_baldur

Reputation: 33508

library(stringr)
df$smallest_number <- sapply(
  str_extract_all(df$concatenated, "[0-9]+"),
  function(x) min(as.integer(x))
)
df
   ID                           concatenated smallest_number
1 001         NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA               4
2 002    3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA               3
3 003 NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA               2
4 004      NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9               4

Upvotes: 1

moodymudskipper
moodymudskipper

Reputation: 47320

We can modify the elements with gsub to have a single underscore separated string for each item, and then use scan and min on them.

df$smallest_number <- sapply(df$concatenated, function(x){
  min(scan(text=gsub("; ","_",x), what = numeric(), sep="_"),na.rm=TRUE)})
df
#    ID                           concatenated smallest_number
# 1 001         NA_NA_NA_NA_5; 6_NA_4_NA_NA_NA               4
# 2 002    3_3_NA_NA_NA_3; 4; 5; 6_NA_NA_NA_NA               3
# 3 003 NA_5_4_2_NA_NA_NA_NA_6; 7; 8; 9; 10_NA               2
# 4 004      NA_NA_11_12_11_NA_4; 5; 6_NA_NA_9               4

Upvotes: 1

Related Questions