Reputation: 305
I have a dataframe with the following structure:
Timestamp Value1 Value2 Problem1 Problem2
00:00 32 40 No No
00:05 12 55 No No
00:10 14 42 Yes No
00:15 50 33 Yes No
00:20 78 47 No No
Where Problem1
defines if there is a problem with Value1
, and Problem2
defines if there is a problem with Value2
. In case of having a Yes
in Problem1
, I'd need to replace Value1
by Value2
. In case of having problems in both, they should keep unchanged.
My problem here is that I won't know how many Value
and Problem
columns I'll have. So, in case of having more than 2, I'd need to replace the value with problems by the average of those values without problems.
So, in another example:
Timestamp Value1 Value2 Value3 Problem1 Problem2 Problem3
00:00 32 40 45 No No No
00:05 12 55 48 No No No
00:10 14 42 55 Yes No No
00:15 50 33 13 Yes No Yes
00:20 78 47 75 No No No
Here I'd need to replace Value1
at 00:10
by the average of Value2
and Value3
. Also, I'd need to replace Value1
and Value3
at 00:15
by Value2
.
Upvotes: 1
Views: 364
Reputation: 16978
I bet there is a more elegant solution.
library(tidyr)
library(dplyr)
df %>%
mutate(across(starts_with("Problem"), ~ .x == "Yes")) %>%
pivot_longer(-Timestamp, names_to = c("name", "id"), names_pattern = "(.*)(\\d+)") %>%
pivot_wider() %>%
group_by(Timestamp) %>%
mutate(Value = case_when(sum(Problem) == 0 | sum(Problem) == n() | !Problem ~ Value,
TRUE~ sum(Value * (1 - Problem))/sum(1-Problem))) %>%
pivot_longer(cols=c("Value", "Problem")) %>%
mutate(name = paste0(name,id), .keep="unused") %>%
pivot_wider() %>%
ungroup() %>%
mutate(across(starts_with("Problem"), ~ ifelse(.x == 1, "Yes", "No")))
returns
# A tibble: 5 x 7
Timestamp Value1 Problem1 Value2 Problem2 Value3 Problem3
<time> <dbl> <chr> <dbl> <chr> <dbl> <chr>
1 00'00" 32 No 40 No 45 No
2 05'00" 12 No 55 No 48 No
3 10'00" 48.5 Yes 42 No 55 No
4 15'00" 33 Yes 33 No 33 Yes
5 20'00" 78 No 47 No 75 No
What approach did I use?
df <- structure(list(Timestamp = structure(c(0, 300, 600, 900, 1200
), class = c("hms", "difftime"), units = "secs"), Value1 = c(32,
12, 14, 50, 78), Value2 = c(40, 55, 42, 33, 47), Value3 = c(45,
48, 55, 13, 75), Problem1 = c("No", "No", "Yes", "Yes", "No"),
Problem2 = c("No", "No", "No", "No", "No"), Problem3 = c("No",
"No", "No", "Yes", "No")), problems = structure(list(row = 5L,
col = "Problem3", expected = "", actual = "embedded null",
file = "literal data"), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -5L), spec = structure(list(
cols = list(Timestamp = structure(list(format = ""), class = c("collector_time",
"collector")), Value1 = structure(list(), class = c("collector_double",
"collector")), Value2 = structure(list(), class = c("collector_double",
"collector")), Value3 = structure(list(), class = c("collector_double",
"collector")), Problem1 = structure(list(), class = c("collector_character",
"collector")), Problem2 = structure(list(), class = c("collector_character",
"collector")), Problem3 = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
Upvotes: 1
Reputation: 1079
I made a sample for the data:
df <- data.frame(value1 = runif(10, min = 0, max = 100),
value2 = runif(10, min = 0, max = 100),
value3 = runif(10, min = 0, max = 100))
df_problem <- data.frame(problem1 = sample(c('yes','no'), 10, replace = T),
problem2 = sample(c('yes','no'), 10, replace = T),
problem3 = sample(c('yes','no'), 10, replace = T))
See that I separated the values from the problems. Then:
df_problem[df_problem == 'yes'] <- 1
df_problem[df_problem == 'no'] <- NA
df_problem <- matrix(as.numeric(unlist(df_problem)), nrow = nrow(df)) #rebuild matrix
Finally:
df <- df * df_problem
for (i in 1:nrow(df)){
if (T %in% is.na(df[i,])){
df[i,c(which(is.na(df[i,])))] <- mean(unlist(df[i,]), na.rm = T)
}
}
df
Upvotes: 0
Reputation: 111
I use the data.table package. I call your data given in the second example "din". I hope this code helps you:
#I use the library data.table; get data in data.table format
library(data.table)
din <- data.table(din)
din[,Value1:=as.numeric(Value1)]
din[,Value2:=as.numeric(Value2)]
din[,Value3:=as.numeric(Value3)]
#set Values to NA if there is a Problem
din[Problem1=="Yes", Value1:=NA]
din[Problem2=="Yes", Value2:=NA]
din[Problem3=="Yes", Value3:=NA]
#print table with NA replaced if we have a Problem
#print(din)
# Timestamp Value1 Value2 Value3 Problem1 Problem2 Problem3
#1: 00:00 32 40 45 No No No
#2: 00:05 12 55 48 No No No
#3: 00:10 NA 42 55 Yes No No
#4: 00:15 NA 33 NA Yes No Yes
#5: 00:20 78 47 75 No No No
#use the mean function to replace if I have an NA in the table (just working if Timestamp is a unique id, otherwise you need to generate one and use this in the by argument)
din[is.na(Value1), Value1:=mean(c(Value2,Value3), na.rm=T), by=Timestamp]
din[is.na(Value2), Value2:=mean(c(Value1,Value2), na.rm=T), by=Timestamp]
din[is.na(Value3), Value3:=mean(c(Value1,Value2), na.rm=T), by=Timestamp]
#print final table
#print(din)
# Timestamp Value1 Value2 Value3 Problem1 Problem2 Problem3
#1: 00:00 32.0 40 45 No No No
#2: 00:05 12.0 55 48 No No No
#3: 00:10 48.5 42 55 Yes No No
#4: 00:15 33.0 33 33 Yes No Yes
#5: 00:20 78.0 47 75 No No No
``
Upvotes: 0