Reputation: 155
I may jave one stupid question, but I'm working with weka to predict the effect of different genes in cancer, something like this
cancer gene1 gene2 gene3 ....
yes 0.85 1.23 3.52 ....
no 7.58 6.25 8.91 ....
no 6.52 5.25 9.85 ....
yes 1.23 0.59 0.74 ....
.....
but with cancer yes =25 and cancer no=158 plus 75 genes. My issue is when I've run for example InfoGain or Gainratio, I have my selected attributes or ranked attributes (genes), but how can I say that those genes predict cancer = yes or cancer = no?
Many thanks!
Upvotes: 0
Views: 1166
Reputation: 6284
In your question and your comment on another answer you mention GainRatio, InfoGain and Cfs. These are attribute selection methods. You can use them to reduce the number of attributes in your dataset by selecting the ones that appear to provide the most information about the property you are trying to predict.
It sounds as if what you want to know is whether each attribute (in your case, gene) is positively or negatively correlated with the outcome of interest - in other words, does a high level of this gene correlate with a high probability of cancer or a low one? This is not what the attribute selection methods are for.
What you want to do, as knb's answer suggests, is build a classification model that predicts the class (cancer
= yes
or cancer
= no
) from the other attributes. A wide variety of modelling algorithms is available and they differ in their interpretability, but you might start by looking at Weka's functions.Logistic
, which will give you a positive or negative correlation coefficient for each attribute, or trees.J48
which will build a decision tree showing which attributes are being used to make the prediction and what the outcome is for each combination of high or low values of the attributes.
If you have a large number of attributes and you believe that only a smaller subset of them are informative then you may wish to use attribute selection before classification - either manually by inspecting the output from the attribute selection method and removing low-scoring attributes before you classify, or automatically in Weka using e.g. meta.AttributeSelectedClassifier
.
If you need more help on choosing and using a suitable classification technique I suggest looking at the Weka documentation and online courses.
Upvotes: 1
Reputation: 9295
I don't know much about genetics, but how do you know that "the" gene causes cancer? It may well be a lot of interacting genes. How do you account for interactions? - your problem.
Focusing on formal/technical things. In Weka your class attribute "cancer" needs to be the last/rightmost column, or you set it manually with the select box "(Nom) cancer" each time before you click on the "Start" button.
You might have a look at the diabetes.arff file that comes with Weka, has a similar structure as your datafile.
If you want to have an interpretable model, you could also run the decision tree algorithm "J48" (in the "Classify" Tab) and in the properties windowset the minNumObj to a higher value (find an appropriate value by trial and error). This creates flat trees with few levels/decisions/if-statements. Then right click on the run (in the lower left panel of the classify tab) and choose "Visualize Tree".
Upvotes: 2
Reputation: 21
You can train your data in Weka and save the model built by it in XML or any other format.Then load that model in Weka or Python or use other language in which you are comfortable.After loading the model you can test the dataset accordingly. In Weka it is very easy.For more clear picture follow the below link: https://machinelearningmastery.com/save-machine-learning-model-make-predictions-weka/
Upvotes: 1