mcfly
mcfly

Reputation: 49

How to count occurence of variable eacht time it occurs and remove outliers in R

I have a vector. On the hand I want to remove factors, which seem to be classified not correct. For instance the "D" at position 7. As the surroundings are "A" this should be "A" too. I know there must be a rule, for example, if the 3 values before and after an outlier are different it is converged- in this case "D" to "A" , otherwise it is removed like the "C" on position 22.

Var = c("A", "A", "A", "A","A", "A", "D", "A", "A","A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C", "B", "B", "C","C","C","C","C","C","C","C","C","C","D", "D","D","D","D","D","D","D", "A", "A", "A", "A","A", "A", "A", "A", "A", "A","A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C","C","C","C","C","C","C","C", "C","C","C","C","C","C","C","C", "D","D","D","D","D")

Var= as.factor(Var)



   Var2=c("1", "2", "1", "2","3", "2", "1", "2", "2","2", "2", "1", "1",
"1", "2", "1", "2","3", "2", "1", "2", "2","2", "2", 
"1", "2", "1", "2","3", "2", "1", "2", "2","2", "2", "1", "1", "2","2", "2", "1", "1","2","2", "2", "1", "1",  "2","2", "2", "1", "1", "2","2", "2", "1", "1","2","2", "2", "1", "1", "2","2", "2", "1", "2","2", "2", "1", "1", "2","2", "2", "1", "1", "2","2", "2", "1", "1","1",
 "1","1","1","1","1")

df<- data.frame (Var, Var2)

Additionally, I want to count the occurences for each variable, if it occurs. So I do nit want to count the occurences in the whole vector, but a list like this. Ideally with the corrected values.

#   Var Occurence
#1  A 6
#2  D 1
#3  A 4
#4  B 10
#5  C 1
#6  B 2 ...

I only get to count the values for the whole vector to get with

table (Var)

By the following code I get a column, which start counting each time the "Var" changes.

df$Var <- with(df, ave(Var, FUN = function(x) sequence(rle(as.character(x))$lengths)))

Upvotes: 1

Views: 40

Answers (1)

akrun
akrun

Reputation: 887088

This may be easier with data.table. Do a grouping by the rleid (run-length-id) of the 'Var', and get the count (.N), then remove the outlier observations by creating a logical expression in i (from the boxplot outliers)

library(data.table)
setDT(df)[, .N, .(Var, grp = rleid(Var))][, grp := NULL][
   !N %in% boxplot(N, plot = FALSE)$out]

-output

    Var  N
 1:   A  6
 2:   D  1
 3:   A  4
 4:   B 10
 5:   C  1
 6:   B  2
 7:   C 10
 8:   D  8
 9:   A 12
10:   B 12
11:   C 16
12:   D  5

rleid can take multiple input columns as the first argument is variadic (...) - from ?rleid

rleid(..., prefix=NULL)

... A sequence of numeric, integer64, character or logical vectors, all of same length. For interactive use.

Therefore, if we have multiple columns, either specify the columns or may use rleidv and the subset of data.frame/data.table as input

setDT(df)[, .N, .(Var,  Var2, grp = rleid(Var, Var2))][,
    grp := NULL][ !N %in% boxplot(N, plot = FALSE)$out]

Upvotes: 1

Related Questions