Tom Maier
Tom Maier

Reputation: 33

Procedure of the OneR algorithm in R

I have used the OneR algorithm of the FSelecter Package to find the Attribut with the lowest error rate. My class Attribut is yes and no. My characteristics of the attributs are also yes and no.

The result of the OneR algorithm is:

Ranking-No. 1

Atribut-Name: OR1: 

Matrix: ------ 0(Attribut-Characteristic)  -- 1 (Attribut Characteristics

0(Class):-------------------25243-------------------0

1(Class: -------------------1459-------------------18

Error-Rate: 1459 (0 + 1459)

Ranking-No. 2

Atribut-Name: OR2: 

Matrix: ------ 0(Attribut-Characteristic)  -- 1 (Attribut Characteristics

0(Class):-------------------25243-------------------0

1(Class: -------------------1460-------------------17

Error-Rate: 1460 (0 + 1460)

However, if I use the correlation function on the same data Frame the best attributs have got a lower error rate than the attributs, which i get with the oneR function.

Atribut-Name: CO4: 

Matrix: ------ 0(Attribut-Characteristic)  -- 1 (Attribut Characteristics

0(Class):-------------------25204-------------------39

1(Class: -------------------1348-------------------129

Error-Rate: 1387 (39 + 1348)

Can anybody tell me, why the OneR algorithm does not show the CO4 Attribut as the best Attribut (based on the error rate)?

Which criterias does the OneR algorithm use?

--- Addition to better understand my question ---

The complete data are too big to show it. I have constructed a new datapool, which has the same effect

DELAYED - OR1 - CO4 ..

1 ---------1--------1--

0 ---------0--------0--

0 ---------0--------1--

1 ---------0--------1--

0 ---------0--------0--

1 ---------0--------1--

0 ---------0--------0--

1 ---------0--------1--

The code for show the error rate for a single attribute:

print(table(datapool_stackoverflow$DELAYED, datapool_stackoverflow$OR1))

The code the OneR function:

library(FSelector)

oneR_stackoverflow <- oneR(DELAYED~., datapool_stackoverflow)

subset_stackoverflow <- cutoff.k(oneR_stackoverflow, 2)

print(subset_stackoverflow)

The code for the correlation:

cor(as.numeric(datapool_stackoverflow$DELAYED), as.numeric(datapool_stackoverflow$OR1))

In this case the results are:

Error-Rate: OR1 Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics

0(Class):---------------------4-------------------------0

1(Class: ---------------------3-------------------------1

Manuel calculated Error-Rate: 3(0 + 3)

Error-Rate: CO4 Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics

0(Class):-----------------------3-----------------------1

1(Class: -----------------------0-----------------------4

Error-Rate: 1(1 + 0)

Correlation: Attribut OR1: 0.377 Attribut CO4: 0.77

OneR: "OR1", "CO4"

Why, does the OneR function provide the OR1 Attribut as the best Attribut to classify?

Upvotes: 0

Views: 644

Answers (3)

vonjd
vonjd

Reputation: 4380

No, the CO4 should be chosen, choosing the other attribute is wrong - see what the OneR package (available on CRAN) gives:

> library(OneR)
> DELAYED <- c(1, 0, 0, 1, 0, 1, 0, 1)
> OR1 <- c(1, rep(0, 7))
> CO4 <- c(1, 0, 1, 1, 0, 1, 0, 1)
> 
> data <- data.frame(DELAYED, OR1, CO4)
> 
> model <- OneR(formula = DELAYED ~., data = data, verbose = T)

    Attribute Accuracy
1 * CO4       87.5%   
2   OR1       62.5%   
---
Chosen attribute due to accuracy
and ties method (if applicable): '*'

> summary(model)

Rules:
If CO4 = 0 then DELAYED = 0
If CO4 = 1 then DELAYED = 1

Accuracy:
7 of 8 instances classified correctly (87.5%)

Contingency table:
       CO4
DELAYED   0   1 Sum
    0   * 3   1   4
    1     0 * 4   4
    Sum   3   5   8
---
Maximum in each column: '*'

Pearson's Chi-squared test:
X-squared = 2.1333, df = 1, p-value = 0.1441

> 
> model_2 <- OneR(formula = DELAYED ~ OR1, data = data)
> summary(model_2)

Rules:
If OR1 = 0 then DELAYED = 0
If OR1 = 1 then DELAYED = 1

Accuracy:
5 of 8 instances classified correctly (62.5%)

Contingency table:
       OR1
DELAYED   0   1 Sum
    0   * 4   0   4
    1     3 * 1   4
    Sum   7   1   8
---
Maximum in each column: '*'

Pearson's Chi-squared test:
X-squared = 0, df = 1, p-value = 1

You can find more information about the OneR package here: https://github.com/vonjd/OneR

(full disclosure: I am the author of this package)

Upvotes: 0

Tom Maier
Tom Maier

Reputation: 33

Ok, i have the solution. The algorithm calculates the sum of the error rate of the characteristcs in a attribut (in relation to the max value of a characteristc)

In this example:

Attribut OR1: 3/7 + 0/1 = 3/7

Attribut CO4: 0/3 + 1/5 = 0.2

Upvotes: 0

Lars Kotthoff
Lars Kotthoff

Reputation: 109242

You haven't given the types of your data, but I'm assuming that you have numerical values. FSelector discretizes these values before using them in oneR and it seems that bad things happen there (which may be a bug in RWeka's Discretize function). However, you probably want factor variables anyway and not numeric data as you have only 0-1 values. Then everything works fine for me:

> df = data.frame(delayed=factor(c(1,0,0,1,0,1,0,1)), or1 = factor(c(1,0,0,0,0,0,0,0)), co4 = factor(c(1,0,1,1,0,1,0,1)))
> library(FSelector)
> oneR(delayed~., df)
    attr_importance
or1       0.2000000
co4       0.4285714

As you can see, co4 now has a much higher importance than or1, as it should have.

Upvotes: 0

Related Questions