MustardRecord
MustardRecord

Reputation: 305

R: replace value of columns for other columns based on condition

I have a dataframe with the following structure:

Timestamp      Value1    Value2    Problem1    Problem2
    00:00          32        40          No          No
    00:05          12        55          No          No
    00:10          14        42         Yes          No
    00:15          50        33         Yes          No
    00:20          78        47          No          No

Where Problem1 defines if there is a problem with Value1, and Problem2 defines if there is a problem with Value2. In case of having a Yes in Problem1, I'd need to replace Value1 by Value2. In case of having problems in both, they should keep unchanged.

My problem here is that I won't know how many Value and Problem columns I'll have. So, in case of having more than 2, I'd need to replace the value with problems by the average of those values without problems.

So, in another example:

Timestamp      Value1    Value2    Value3    Problem1    Problem2    Problem3
    00:00          32        40        45          No          No          No
    00:05          12        55        48          No          No          No
    00:10          14        42        55         Yes          No          No
    00:15          50        33        13         Yes          No         Yes
    00:20          78        47        75          No          No          No

Here I'd need to replace Value1 at 00:10 by the average of Value2 and Value3. Also, I'd need to replace Value1 and Value3 at 00:15 by Value2.

Upvotes: 1

Views: 364

Answers (3)

Martin Gal
Martin Gal

Reputation: 16978

I bet there is a more elegant solution.

library(tidyr)
library(dplyr)

df %>%
  mutate(across(starts_with("Problem"), ~ .x == "Yes")) %>%
  pivot_longer(-Timestamp, names_to = c("name", "id"), names_pattern = "(.*)(\\d+)") %>% 
  pivot_wider() %>%
  group_by(Timestamp) %>%
  mutate(Value = case_when(sum(Problem) == 0 | sum(Problem) == n() | !Problem ~ Value,
                               TRUE~ sum(Value * (1 - Problem))/sum(1-Problem))) %>%
  pivot_longer(cols=c("Value", "Problem")) %>% 
  mutate(name = paste0(name,id), .keep="unused") %>%
  pivot_wider() %>% 
  ungroup() %>% 
  mutate(across(starts_with("Problem"), ~ ifelse(.x == 1, "Yes", "No")))

returns

# A tibble: 5 x 7
  Timestamp Value1 Problem1 Value2 Problem2 Value3 Problem3
  <time>     <dbl> <chr>     <dbl> <chr>     <dbl> <chr>   
1 00'00"      32   No           40 No           45 No      
2 05'00"      12   No           55 No           48 No      
3 10'00"      48.5 Yes          42 No           55 No      
4 15'00"      33   Yes          33 No           33 Yes     
5 20'00"      78   No           47 No           75 No 

What approach did I use?

  • Transform your Problem Variable into a boolean. R is able to use booleans in calculations, technically it is transformed later into a double.
  • Turn your value/problem numbers into a id, so for every timestamp there are several rows for Value and Problem.
  • Calculate the new value based on the number of problems and if the value is problematic.
  • Restore the shape of your data.frame.

Data

df <- structure(list(Timestamp = structure(c(0, 300, 600, 900, 1200
), class = c("hms", "difftime"), units = "secs"), Value1 = c(32, 
12, 14, 50, 78), Value2 = c(40, 55, 42, 33, 47), Value3 = c(45, 
48, 55, 13, 75), Problem1 = c("No", "No", "Yes", "Yes", "No"), 
    Problem2 = c("No", "No", "No", "No", "No"), Problem3 = c("No", 
    "No", "No", "Yes", "No")), problems = structure(list(row = 5L, 
    col = "Problem3", expected = "", actual = "embedded null", 
    file = "literal data"), row.names = c(NA, -1L), class = c("tbl_df", 
"tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -5L), spec = structure(list(
    cols = list(Timestamp = structure(list(format = ""), class = c("collector_time", 
    "collector")), Value1 = structure(list(), class = c("collector_double", 
    "collector")), Value2 = structure(list(), class = c("collector_double", 
    "collector")), Value3 = structure(list(), class = c("collector_double", 
    "collector")), Problem1 = structure(list(), class = c("collector_character", 
    "collector")), Problem2 = structure(list(), class = c("collector_character", 
    "collector")), Problem3 = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1L), class = "col_spec"))

Upvotes: 1

Pedro Alencar
Pedro Alencar

Reputation: 1079

I made a sample for the data:

df <- data.frame(value1 = runif(10, min = 0, max = 100),
                 value2 = runif(10, min = 0, max = 100),
                 value3 = runif(10, min = 0, max = 100))

df_problem <- data.frame(problem1 = sample(c('yes','no'), 10, replace = T),
                problem2 = sample(c('yes','no'), 10, replace = T),
                problem3 = sample(c('yes','no'), 10, replace = T))

See that I separated the values from the problems. Then:

df_problem[df_problem == 'yes'] <- 1
df_problem[df_problem == 'no'] <- NA
df_problem <- matrix(as.numeric(unlist(df_problem)), nrow = nrow(df)) #rebuild matrix

Finally:

df <- df * df_problem

for (i in 1:nrow(df)){
  if (T %in% is.na(df[i,])){
    df[i,c(which(is.na(df[i,])))] <- mean(unlist(df[i,]), na.rm = T)
  }
}

df

Upvotes: 0

wiebke
wiebke

Reputation: 111

I use the data.table package. I call your data given in the second example "din". I hope this code helps you:

#I use the library data.table; get data in data.table format
library(data.table)
din <- data.table(din)
din[,Value1:=as.numeric(Value1)]
din[,Value2:=as.numeric(Value2)]
din[,Value3:=as.numeric(Value3)]

#set Values to NA if there is a Problem
din[Problem1=="Yes", Value1:=NA]
din[Problem2=="Yes", Value2:=NA]
din[Problem3=="Yes", Value3:=NA]

#print table with NA replaced if we have a Problem
#print(din)
#   Timestamp Value1 Value2 Value3 Problem1 Problem2 Problem3
#1:     00:00     32     40     45       No       No       No
#2:     00:05     12     55     48       No       No       No
#3:     00:10     NA     42     55      Yes       No       No
#4:     00:15     NA     33     NA      Yes       No      Yes
#5:     00:20     78     47     75       No       No       No

#use the mean function to replace if I have an NA in the table (just working if Timestamp is a unique id, otherwise you need to generate one and use this in the by argument)
din[is.na(Value1), Value1:=mean(c(Value2,Value3), na.rm=T), by=Timestamp]
din[is.na(Value2), Value2:=mean(c(Value1,Value2), na.rm=T), by=Timestamp]
din[is.na(Value3), Value3:=mean(c(Value1,Value2), na.rm=T), by=Timestamp]

#print final table
#print(din)
#   Timestamp Value1 Value2 Value3 Problem1 Problem2 Problem3
#1:     00:00   32.0     40     45       No       No       No
#2:     00:05   12.0     55     48       No       No       No
#3:     00:10   48.5     42     55      Yes       No       No
#4:     00:15   33.0     33     33      Yes       No      Yes
#5:     00:20   78.0     47     75       No       No       No

``

Upvotes: 0

Related Questions