Reputation: 218
I have a dataset where I have grouped by a Gene
column. Some values grouped into each row are just .,
so I remove them, leaving only several numeric characters per row and column.
To do this am coding:
#Group by Gene:
data <- setDT(df2)[, lapply(.SD, paste, collapse = ", "), by = Genes]
#Remove ., from anywhere in the dataframe
dat <- data.frame(lapply(data, function(x) {
gsub("\\.,|\\.$|\\,$|(, .$)", "", x)
}))
My data before removing .,
and after grouping by Gene
looks like:
Gene col1 col2 col3 col4
ACE 0.3, 0.4, 0.5, 0.5 . ., ., . 1, 1, 1, 1, 1
NOS2 ., . . ., ., ., . 0, 0, 0, 0, 0
BRCA1 . ., . 1, 1, 1, 1, 1
HER2 . 0.1, ., ., 0.2, 0.1 . 1, 1, 1, 1, 1
After removing .,
my data looks like:
Gene col1 col2 col3 col4
ACE 0.3, 0.4, 0.5, 0.5 1, 1, 1, 1, 1
NOS2 0, 0, 0, 0, 0
BRCA1 1, 1, 1, 1, 1
HER2 0.1, 0.2, 0.1 1, 1, 1, 1, 1
I am now trying to select the minimum or maximum value per row and column.
Expecting example output:
Gene col1 col2 col3 col4
ACE 0.5 1
NOS2 0
BRCA1 1
HER2 0.1 1
#For col1 I need the max value per row (so for ACE 0.5 is selected)
#For col2 I need the min value per row
For note, my actual data is 100 columns and 20,000 rows - different columns need either max or min values per gene selected.
However with the code I use I am only getting the expected output for col4
and my other columns repeat the selected value twice (I am getting 0.5, 0.5
and 0.1, 0.1
and I can't figure out why).
The code I am using to select min/max values is:
#Max value per feature and row
max2 = function(x) if(all(is.na(x))) NA else max(x,na.rm = T)
getmax = function(col) str_extract_all(col,"[0-9\\.-]+") %>%
lapply(.,function(x)max2(as.numeric(x)) ) %>%
unlist()
#Min value per feature and row
min2 = function(x) if(all(is.na(x))) NA else min(x,na.rm = T)
getmin = function(col) str_extract_all(col,"[0-9\\.-]+") %>%
lapply(.,function(x)min2(as.numeric(x)) ) %>%
unlist()
data <- dt %>%
mutate_at(names(dt)[2],getmax)
data <- dt %>%
mutate_at(names(dt)[3],getmin)
data <- dt %>%
mutate_at(names(dt)[4],getmax)
Why aren't these selection functions working for all my columns? All columns are character class. I'm also wondering if I even need to remove .,
at all and can just jump straight to selecting the max/min value per row and column?
Example input data:
structure(list(Gene = c("ACE", "NOS2", "BRCA1", "HER2"), col1 = c("0.3, 0.4, 0.5, 0.5",
"", "", ""), col2 = c("", "", "", " 0.1, 0.2 0.,1"), col3 = c(NA,
NA, NA, NA), col4 = c(" 1, 1, 1, 1, 1",
" 0, 0, 0, 0, 0", " 1, 1, 1, 1, 1",
" 1, 1, 1, 1, 1")), row.names = c(NA, -4L), class = c("data.table",
"data.frame"))
Upvotes: 1
Views: 453
Reputation: 8844
You can use type.convert
and set its argument na.strings
to "."
. You may also want to use the range
function to get both min and max in one shot.
Assume that your data.table
looks like this
> dt
Gene col1 col2 col3 col4
1: ACE 0.3, 0.4, 0.5, 0.5 . ., ., . 1, 1, 1, 1, 1
2: NOS2 ., . . ., ., ., . 0, 0, 0, 0, 0
3: BRCA1 . ., . 1, 1, 1, 1, 1
4: HER2 . 0.1, ., ., 0.2, 0.1 . 1, 1, 1, 1, 1
Consider a function like this
library(data.table)
library(stringr)
get_range <- function(x) {
x <- type.convert(str_split(x, ",\\s+", simplify = TRUE), na.strings = ".")
x <- t(apply(x, 1L, function(i) {
i <- i[!is.na(i)]
if (length(i) < 1L) c(NA_real_, NA_real_) else range(i)
}))
dimnames(x)[[2L]] <- c("min", "max")
x
}
Then you can just
dt[, c(Gene = .(Gene), lapply(.SD, get_range)), .SDcols = -"Gene"]
Output
Gene col1.min col1.max col2.min col2.max col3.min col3.max col4.min col4.max
1: ACE 0.3 0.5 NA NA NA NA 1 1
2: NOS2 NA NA NA NA NA NA 0 0
3: BRCA1 NA NA NA NA NA NA 1 1
4: HER2 NA NA 0.1 0.2 NA NA 1 1
Note that there is no need to do it by Gene
as the function get_range
is already vectorised.
Upvotes: 1