Reputation: 33
I have used the OneR algorithm of the FSelecter Package to find the Attribut with the lowest error rate. My class Attribut is yes and no. My characteristics of the attributs are also yes and no.
The result of the OneR algorithm is:
Ranking-No. 1
Atribut-Name: OR1:
Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics
0(Class):-------------------25243-------------------0
1(Class: -------------------1459-------------------18
Error-Rate: 1459 (0 + 1459)
Ranking-No. 2
Atribut-Name: OR2:
Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics
0(Class):-------------------25243-------------------0
1(Class: -------------------1460-------------------17
Error-Rate: 1460 (0 + 1460)
However, if I use the correlation function on the same data Frame the best attributs have got a lower error rate than the attributs, which i get with the oneR function.
Atribut-Name: CO4:
Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics
0(Class):-------------------25204-------------------39
1(Class: -------------------1348-------------------129
Error-Rate: 1387 (39 + 1348)
Can anybody tell me, why the OneR algorithm does not show the CO4 Attribut as the best Attribut (based on the error rate)?
Which criterias does the OneR algorithm use?
--- Addition to better understand my question ---
The complete data are too big to show it. I have constructed a new datapool, which has the same effect
DELAYED - OR1 - CO4 ..
1 ---------1--------1--
0 ---------0--------0--
0 ---------0--------1--
1 ---------0--------1--
0 ---------0--------0--
1 ---------0--------1--
0 ---------0--------0--
1 ---------0--------1--
The code for show the error rate for a single attribute:
print(table(datapool_stackoverflow$DELAYED, datapool_stackoverflow$OR1))
The code the OneR function:
library(FSelector)
oneR_stackoverflow <- oneR(DELAYED~., datapool_stackoverflow)
subset_stackoverflow <- cutoff.k(oneR_stackoverflow, 2)
print(subset_stackoverflow)
The code for the correlation:
cor(as.numeric(datapool_stackoverflow$DELAYED), as.numeric(datapool_stackoverflow$OR1))
In this case the results are:
Error-Rate: OR1 Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics
0(Class):---------------------4-------------------------0
1(Class: ---------------------3-------------------------1
Manuel calculated Error-Rate: 3(0 + 3)
Error-Rate: CO4 Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics
0(Class):-----------------------3-----------------------1
1(Class: -----------------------0-----------------------4
Error-Rate: 1(1 + 0)
Correlation: Attribut OR1: 0.377 Attribut CO4: 0.77
OneR: "OR1", "CO4"
Why, does the OneR function provide the OR1 Attribut as the best Attribut to classify?
Upvotes: 0
Views: 644
Reputation: 4380
No, the CO4
should be chosen, choosing the other attribute is wrong - see what the OneR package (available on CRAN) gives:
> library(OneR)
> DELAYED <- c(1, 0, 0, 1, 0, 1, 0, 1)
> OR1 <- c(1, rep(0, 7))
> CO4 <- c(1, 0, 1, 1, 0, 1, 0, 1)
>
> data <- data.frame(DELAYED, OR1, CO4)
>
> model <- OneR(formula = DELAYED ~., data = data, verbose = T)
Attribute Accuracy
1 * CO4 87.5%
2 OR1 62.5%
---
Chosen attribute due to accuracy
and ties method (if applicable): '*'
> summary(model)
Rules:
If CO4 = 0 then DELAYED = 0
If CO4 = 1 then DELAYED = 1
Accuracy:
7 of 8 instances classified correctly (87.5%)
Contingency table:
CO4
DELAYED 0 1 Sum
0 * 3 1 4
1 0 * 4 4
Sum 3 5 8
---
Maximum in each column: '*'
Pearson's Chi-squared test:
X-squared = 2.1333, df = 1, p-value = 0.1441
>
> model_2 <- OneR(formula = DELAYED ~ OR1, data = data)
> summary(model_2)
Rules:
If OR1 = 0 then DELAYED = 0
If OR1 = 1 then DELAYED = 1
Accuracy:
5 of 8 instances classified correctly (62.5%)
Contingency table:
OR1
DELAYED 0 1 Sum
0 * 4 0 4
1 3 * 1 4
Sum 7 1 8
---
Maximum in each column: '*'
Pearson's Chi-squared test:
X-squared = 0, df = 1, p-value = 1
You can find more information about the OneR package here: https://github.com/vonjd/OneR
(full disclosure: I am the author of this package)
Upvotes: 0
Reputation: 33
Ok, i have the solution. The algorithm calculates the sum of the error rate of the characteristcs in a attribut (in relation to the max value of a characteristc)
In this example:
Attribut OR1: 3/7 + 0/1 = 3/7
Attribut CO4: 0/3 + 1/5 = 0.2
Upvotes: 0
Reputation: 109242
You haven't given the types of your data, but I'm assuming that you have numerical values. FSelector discretizes these values before using them in oneR
and it seems that bad things happen there (which may be a bug in RWeka's Discretize
function). However, you probably want factor variables anyway and not numeric data as you have only 0-1 values. Then everything works fine for me:
> df = data.frame(delayed=factor(c(1,0,0,1,0,1,0,1)), or1 = factor(c(1,0,0,0,0,0,0,0)), co4 = factor(c(1,0,1,1,0,1,0,1)))
> library(FSelector)
> oneR(delayed~., df)
attr_importance
or1 0.2000000
co4 0.4285714
As you can see, co4 now has a much higher importance than or1, as it should have.
Upvotes: 0