Andy Hui
Andy Hui

Reputation: 420

Weka CSVloader - error (wrong number of value. Read)

im trying to convert CSV to ARFF by using weka. But it pop out those error message.

weka.core.converters.CSVLoaderfailed to lead <my file>
Reason:
wrong number of values. Read 7, expected 9, read Token[EOL], line 26

i have trying to replace " ' , % those special characters but also remain the same.

Any idea?

https://drive.google.com/open?id=1__u9SGOxd-ShU9Eei3tDjZ9s1MxzKEKZ the file of the link

Upvotes: 0

Views: 837

Answers (1)

Sentry
Sentry

Reputation: 4113

Short answer:

The line breaks inside values are the problem. Replace them with something else, e.g., spaces.

Long answer:

The problem is that your values contain line breaks (\n or such) and that the CSVLoader from Weka can't handle it. Line 26 is the first line in your file that contains such a line break, but the CSVLoader thinks the row ends there after only 7 fields are read.

Have a look at the source code:

private void initTokenizer(StreamTokenizer tokenizer) {
  tokenizer.resetSyntax();
  tokenizer.whitespaceChars(0, (' ' - 1));
  tokenizer.wordChars(' ', '\u00FF');
  tokenizer.whitespaceChars(m_FieldSeparator.charAt(0),
    m_FieldSeparator.charAt(0));
  // tokenizer.commentChar('%');

  String[] parts = m_Enclosures.split(",");
  for (String e : parts) {
    if (e.length() > 1 || e.length() == 0) {
      throw new IllegalArgumentException(
        "Enclosures can only be single characters");
    }
    tokenizer.quoteChar(e.charAt(0));
  }

  tokenizer.eolIsSignificant(true);    // <--- This line is important
}

The last line there basically says that the tokenizer should treat an end of line (EOL) as special character (see the API doc):

If the flag is false, end-of-line characters are treated as white space and serve only to separate tokens.

The getInstance method of the CSVLoader contains this logic (summarized):

private String getInstance(StreamTokenizer tokenizer) throws IOException {  

    // [...]

    boolean first = true;
    boolean wasSep;
    m_current.clear();

    int i = 0;
    while (tokenizer.ttype != StreamTokenizer.TT_EOL
      && tokenizer.ttype != StreamTokenizer.TT_EOF) {

      // Get next token
      if (!first) {
        StreamTokenizerUtils.getToken(tokenizer);
      }

      if (tokenizer.ttype == m_FieldSeparator.charAt(0)
        || tokenizer.ttype == StreamTokenizer.TT_EOL) {
        m_current.add("?");
        wasSep = true;
      } else {
        // Parsing values
        // [...]
      }

      if (!wasSep) {
        StreamTokenizerUtils.getToken(tokenizer);
      }
      first = false;
      i++;
    }

    // check number of values read
    if (m_current.size() != m_structure.numAttributes()) {
      for (Object o : m_current) {
        System.out.print(o.toString() + "|||");
      }
      System.out.println();
      StreamTokenizerUtils.errms(tokenizer, "wrong number of values. Read "
        + m_current.size() + ", expected " + m_structure.numAttributes());

    }
  // [...]
}

So, not matter if the line break is inside quotes, the tokenizer will always treat it as StreamTokenizer.TT_EOL, which ends reading a record and thus you end up with less fields than expected.

Upvotes: 1

Related Questions