TomP
TomP

Reputation: 133

Weka and CSV files

I'm currently trying to import some data into weka. Currently the data is in a CSV file, and consists of a numerical ID and then some string data(Tweets). I'm getting an error where it is reading "Wrong number of values, Read 1, expected 2 Token[EOL], line 17". I'm using quotes as my enclosure characters for the String data. I understand that something(presumably an EOL character?) is causing weka to incorrectly separate some of the String data into multiple entries on the same line, but I'm not sure how to fix the EOL token problem.

My data set can be viewed here. The current data set is on Sheet 2:

https://docs.google.com/spreadsheets/d/1Yclu0t4ITFWn6itYBsVtkGalmP9BPaWFFP6U6jAeLMU/edit?usp=sharing

The text file itself may be found here:

https://drive.google.com/file/d/0B433FqC3TscQQkRxZklQclA3Z3M/view?usp=sharing

Current error is now on the 3rd line, with the same error. The only newline character there is the one at the end of the line denoting a new entry, so I'm not sure why its having issues.

Upvotes: 2

Views: 2724

Answers (1)

Rushdi Shams
Rushdi Shams

Reputation: 2423

In its datasets, Weka considers a newline character as an indication of the end of instance. Your line 17 is actually a multi-line tweet which confuses Weka. You can use either

  1. a RegEx to get rid of the newline characters in every single tweet or
  2. during downloading the tweets, clean the tweets to get rid of any newline character in them.

Unfortunately, Weka does not have a mechanism to get rid of this problem by itself (as far as I know).


EDIT

Okay, here are some other things that need to be fixed (according to your EDITS in the question):

  1. Replace ' with \'
  2. Replace grave accent with \grave accent
  3. Many tweets contain quotes inside quotes. The inside double quotes (") should be replaced by \"
  4. If you put your tweets inside double quotes, then your header should be id, "text"
  5. Some tweets contain two consecutive double quotes, get rid of them or replace them with \".
  6. I cannot say exactly where, because I lost trace, but I think still some tweets contain new lines in them (or at least one tweet has it still)

These are just a few things that I noticed. There might be more. Time will tell.

Upvotes: 2

Related Questions