How to select max numeric value out of numeric characters?

Question

I have a dataset where I have grouped by a Gene column. Some values grouped into each row are just ., so I remove them, leaving only several numeric characters per row and column.

To do this am coding:

#Group by Gene:
data <- setDT(df2)[, lapply(.SD, paste, collapse = ", "), by = Genes]

#Remove ., from anywhere in the dataframe
dat <- data.frame(lapply(data, function(x) {
  gsub("\.,|\.$|\,$|(, .$)", "", x)
}))

My data before removing ., and after grouping by Gene looks like:

Gene    col1                     col2                  col3           col4
ACE     0.3, 0.4, 0.5, 0.5       .                      ., ., .        1, 1, 1, 1, 1
NOS2    ., .                     .                      ., ., ., .     0, 0, 0, 0, 0
BRCA1   .                                               ., .           1, 1, 1, 1, 1
HER2    .                        0.1, ., .,  0.2, 0.1   .              1, 1, 1, 1, 1

After removing ., my data looks like:

Gene    col1                 col2               col3     col4
ACE     0.3, 0.4, 0.5, 0.5                               1, 1, 1, 1, 1
NOS2                                                     0, 0, 0, 0, 0
BRCA1                                                    1, 1, 1, 1, 1
HER2                         0.1,      0.2, 0.1          1, 1, 1, 1, 1

I am now trying to select the minimum or maximum value per row and column.

Expecting example output:

Gene    col1                 col2            col3    col4
ACE     0.5                                           1
NOS2                                                  0
BRCA1                                                 1
HER2                          0.1                     1

#For col1 I need the max value per row (so for ACE 0.5 is selected)
#For col2 I need the min value per row

For note, my actual data is 100 columns and 20,000 rows - different columns need either max or min values per gene selected.

However with the code I use I am only getting the expected output for col4 and my other columns repeat the selected value twice (I am getting 0.5, 0.5 and 0.1, 0.1 and I can't figure out why).

The code I am using to select min/max values is:

#Max value per feature and row
max2 = function(x) if(all(is.na(x))) NA else max(x,na.rm = T)
getmax = function(col) str_extract_all(col,"[0-9\.-]+") %>%
  lapply(.,function(x)max2(as.numeric(x)) ) %>%
  unlist() 

#Min value per feature and row
min2 = function(x) if(all(is.na(x))) NA else min(x,na.rm = T)
getmin = function(col) str_extract_all(col,"[0-9\.-]+") %>%
  lapply(.,function(x)min2(as.numeric(x)) ) %>%
  unlist() 

data <- dt %>%
  mutate_at(names(dt)[2],getmax)

data <- dt %>%
  mutate_at(names(dt)[3],getmin)

data <- dt %>%
  mutate_at(names(dt)[4],getmax)

Why aren't these selection functions working for all my columns? All columns are character class. I'm also wondering if I even need to remove ., at all and can just jump straight to selecting the max/min value per row and column?

Example input data:

structure(list(Gene = c("ACE", "NOS2", "BRCA1", "HER2"), col1 = c("0.3, 0.4, 0.5, 0.5", 
"", "", ""), col2 = c("", "", "", "  0.1,      0.2 0.,1"), col3 = c(NA, 
NA, NA, NA), col4 = c("                         1, 1, 1, 1, 1", 
"                                     0, 0, 0, 0, 0", "                                     1, 1, 1, 1, 1", 
"     1, 1, 1, 1, 1")), row.names = c(NA, -4L), class = c("data.table", 
"data.frame"))

ekoam · Accepted Answer

You can use type.convert and set its argument na.strings to ".". You may also want to use the range function to get both min and max in one shot.

Assume that your data.table looks like this

> dt
    Gene               col1                 col2       col3          col4
1:   ACE 0.3, 0.4, 0.5, 0.5                    .    ., ., . 1, 1, 1, 1, 1
2:  NOS2               ., .                    . ., ., ., . 0, 0, 0, 0, 0
3: BRCA1                  .                            ., . 1, 1, 1, 1, 1
4:  HER2                  . 0.1, ., .,  0.2, 0.1          . 1, 1, 1, 1, 1

Consider a function like this

library(data.table)
library(stringr)

get_range <- function(x) {
  x <- type.convert(str_split(x, ",\s+", simplify = TRUE), na.strings = ".")
  x <- t(apply(x, 1L, function(i) {
    i <- i[!is.na(i)]
    if (length(i) < 1L) c(NA_real_, NA_real_) else range(i)
  }))
  dimnames(x)[[2L]] <- c("min", "max")
  x
}

Then you can just

dt[, c(Gene = .(Gene), lapply(.SD, get_range)), .SDcols = -"Gene"]

Output

    Gene col1.min col1.max col2.min col2.max col3.min col3.max col4.min col4.max
1:   ACE      0.3      0.5       NA       NA       NA       NA        1        1
2:  NOS2       NA       NA       NA       NA       NA       NA        0        0
3: BRCA1       NA       NA       NA       NA       NA       NA        1        1
4:  HER2       NA       NA      0.1      0.2       NA       NA        1        1

Note that there is no need to do it by Gene as the function get_range is already vectorised.

How to select max numeric value out of numeric characters?

Answers (1)

Related Questions