Reputation: 49
I have a vector. On the hand I want to remove factors, which seem to be classified not correct. For instance the "D" at position 7. As the surroundings are "A" this should be "A" too. I know there must be a rule, for example, if the 3 values before and after an outlier are different it is converged- in this case "D" to "A" , otherwise it is removed like the "C" on position 22.
Var = c("A", "A", "A", "A","A", "A", "D", "A", "A","A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C", "B", "B", "C","C","C","C","C","C","C","C","C","C","D", "D","D","D","D","D","D","D", "A", "A", "A", "A","A", "A", "A", "A", "A", "A","A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C","C","C","C","C","C","C","C", "C","C","C","C","C","C","C","C", "D","D","D","D","D")
Var= as.factor(Var)
Var2=c("1", "2", "1", "2","3", "2", "1", "2", "2","2", "2", "1", "1",
"1", "2", "1", "2","3", "2", "1", "2", "2","2", "2",
"1", "2", "1", "2","3", "2", "1", "2", "2","2", "2", "1", "1", "2","2", "2", "1", "1","2","2", "2", "1", "1", "2","2", "2", "1", "1", "2","2", "2", "1", "1","2","2", "2", "1", "1", "2","2", "2", "1", "2","2", "2", "1", "1", "2","2", "2", "1", "1", "2","2", "2", "1", "1","1",
"1","1","1","1","1")
df<- data.frame (Var, Var2)
Additionally, I want to count the occurences for each variable, if it occurs. So I do nit want to count the occurences in the whole vector, but a list like this. Ideally with the corrected values.
# Var Occurence
#1 A 6
#2 D 1
#3 A 4
#4 B 10
#5 C 1
#6 B 2 ...
I only get to count the values for the whole vector to get with
table (Var)
By the following code I get a column, which start counting each time the "Var" changes.
df$Var <- with(df, ave(Var, FUN = function(x) sequence(rle(as.character(x))$lengths)))
Upvotes: 1
Views: 40
Reputation: 887088
This may be easier with data.table
. Do a grouping by the rleid
(run-length-id) of the 'Var', and get the count (.N
), then remove the outlier observations by creating a logical expression in i
(from the boxplot
outliers)
library(data.table)
setDT(df)[, .N, .(Var, grp = rleid(Var))][, grp := NULL][
!N %in% boxplot(N, plot = FALSE)$out]
-output
Var N
1: A 6
2: D 1
3: A 4
4: B 10
5: C 1
6: B 2
7: C 10
8: D 8
9: A 12
10: B 12
11: C 16
12: D 5
rleid
can take multiple input columns as the first argument is variadic (...
) - from ?rleid
rleid(..., prefix=NULL)
... A sequence of numeric, integer64, character or logical vectors, all of same length. For interactive use.
Therefore, if we have multiple columns, either specify the columns or may use rleidv
and the subset of data.frame/data.table as input
setDT(df)[, .N, .(Var, Var2, grp = rleid(Var, Var2))][,
grp := NULL][ !N %in% boxplot(N, plot = FALSE)$out]
Upvotes: 1