kailash
kailash

Reputation: 21

Multilabel classification with SVM using rapidminer

I want to classify text data using classifier model SVM with Rapidminer tool. Classification would be of multilable type. Since my data is of text type, how SVM can be used for this classification. I know that SVM works with numeric data only.

Upvotes: 0

Views: 5827

Answers (2)

Bernd Winter
Bernd Winter

Reputation: 41

This question may be rather old, but perhaps there are more people like me out there, just experimenting with Rapidminer, hoping to solve exactly the same problem.

I guess the first part about dealing with text in general using Rapidminer's plugin "Text Mining Extension" has been already properly explained by maerch a while before. But taking kailash's comments into account, the main issue seems to be the incompatibility between the binominal SVM model and the polynominal input/label set.

The actual feeding of the SVM model is done by adding the meta operator "Polynomial by Binomial Classification" as a wrapper around the SVM. It merges the input classes multiple times (in a way you can choose with the "classification strategies" parameter) so that there are always two input groups and feeds them to the SVM until a combined result can be derived. That resulting model is then capable of dealing with multiple classes.

The process snippet below illustrates a SVM (default parameters) with its Poly2Bi-Wrapper:

<process expanded="true">
    <operator activated="true" class="polynomial_by_binomial_classification" compatibility="5.3.015" expanded="true" height="76" name="Polynominal by Binominal Classification" width="90" x="112" y="120">
        <parameter key="classification_strategies" value="1 against all"/>
        <parameter key="random_code_multiplicator" value="2.0"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <process expanded="true">
            <operator activated="true" class="support_vector_machine_linear" compatibility="5.3.015" expanded="true" height="76" name="SVM (Linear)" width="90" x="179" y="210">
                <parameter key="kernel_cache" value="200"/>
                <parameter key="C" value="0.0"/>
                <parameter key="convergence_epsilon" value="0.001"/>
                <parameter key="max_iterations" value="100000"/>
                <parameter key="scale" value="true"/>
                <parameter key="L_pos" value="1.0"/>
                <parameter key="L_neg" value="1.0"/>
                <parameter key="epsilon" value="0.0"/>
                <parameter key="epsilon_plus" value="0.0"/>
                <parameter key="epsilon_minus" value="0.0"/>
                <parameter key="balance_cost" value="false"/>
                <parameter key="quadratic_loss_pos" value="false"/>
                <parameter key="quadratic_loss_neg" value="false"/>
            </operator>
            <connect from_port="training set" to_op="SVM (Linear)" to_port="training set"/>
            <connect from_op="SVM (Linear)" from_port="model" to_port="model"/>
            <portSpacing port="source_training set" spacing="0"/>
            <portSpacing port="sink_model" spacing="0"/>
        </process>
    </operator>
    <connect from_port="training" to_op="Polynominal by Binominal Classification" to_port="training set"/>
    <connect from_op="Polynominal by Binominal Classification" from_port="model" to_port="model"/>
    <portSpacing port="source_training" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
</process>

Note that (at least) version 5.3.015 of RapidMiner complains when the Poly2Bi operator is used in this fashion inside the training region of a Validation operator and there's a Performance operator in the testing area. There will be an error message of the Performance operator:

Label and prediction must be of the same type but are polynominal and nominal, respectively.

But in the RapidMiner forums, they point out that this seems to be a useless warning which you can ignore. In my case, the process worked fine as well.

Upvotes: 1

maerch
maerch

Reputation: 2063

The missing piece you are looking for is called "word vector". Basically you have to create a new example set where a single attribute will represent a single word. For a given example (i.e. a document) the (numerical) value for this attribute will show the "importance" of this word for this document.

A naive approach would be to use the count of the word within the document, but typically you should use TD-IDF (term frequency–inverse document frequency) which will take the whole document corpus into account as well.

To do this in RapidMiner you have to install the text mining extension and use operators like "Process Documents from Data" or "Process Documents from Files". Keep in mind that for text mining you will need to conduct more preprocessing steps like creating tokens, removing stop words (common words which you can find in nearly all documents and which are therefore not very helpful) and use the stem of the words (so "word" and "words" will be treated equally).

Here is a small example:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.009">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.009" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document" width="90" x="45" y="75">
        <parameter key="text" value="I want to classify text data using classifier model SVM with Rapidminer tool. Classification would be of multilable type. Since my data is of text type, how SVM can be used for this classification. I know that SVM works with numeric data only."/>
      </operator>
      <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (2)" width="90" x="45" y="165">
        <parameter key="text" value="The missing piece you are looking for is called &quot;word vector&quot;. Basically you have to create a new example set for which the attributes will represent the words. For a given example (i.e. a document) the (numerical) value for this attribute will show the &quot;importance&quot; of this word for this document. &#10;&#10;A naive approach would be to use the count of the word within the document, but typically you should use TD-IDF (term frequency–inverse document frequency) which will take the whole document corpus into account as well.&#10;&#10;To do this in RapidMiner you have to install the text mining extension and use operators like &quot;Process Documents from Data&quot; or &quot;Process Documents from Files&quot;. Keep in mind that for text mining you will need to conduct more preprocessing steps like creating tokens, removing stop words (common words which you can find in nearly all documents and which are therefore not very helpful) and use the stem of the words (so &quot;word&quot; and &quot;words&quot; will be treated equally).&#10;&#10;Here is a small example:"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="112" name="Process Documents" width="90" x="179" y="75">
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.000" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="179" y="30"/>
          <operator activated="true" class="text:stem_porter" compatibility="5.3.000" expanded="true" height="60" name="Stem (Porter)" width="90" x="313" y="30"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
          <connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Create Document (2)" from_port="output" to_op="Process Documents" to_port="documents 2"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

BTW: There are also a few quite good text mining tutorials with RapidMiner on youtube.

Upvotes: 1

Related Questions