kidman01
kidman01

Reputation: 945

Problems with Pandas

sorry for the vague title, but since I don't really know what the problem is... the thing is that I want to load a CSV file, then split it up into two arrays and perform a function on each of those arrays. It works for the first array but the second one is making problems even though every thing is the same. I'm really stuck. The Code is as follows:

from wordutility import wordutility
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import cross_validation
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
import pandas as pd
import numpy as np

data = pd.read_csv('sts_gold_tweet.csv', header=None, delimiter=';',
               quotechar='"')

# test = pd.read_csv('output.csv', header=None,
#                   delimiter=';', quotechar='"')

split_ratio = 0.9
train = data[:round(len(data)*split_ratio)]
test = data[round(len(data)*split_ratio):]

y = data[1]

print("Cleaning and parsing tweets data...\n")

traindata = []

for i in range(0, len(train[0])):
     traindata.append(" ".join(wordutility.tweet_to_wordlist
                          (train[0][i], False)))

testdata = []

for i in range(0, len(test[0])):
    testdata.append(" ".join(wordutility.tweet_to_wordlist(test[0][i], False)))

The program works up until the very last line. The error is:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python3.4/site-packages/pandas/core/series.py", line 509, in __getitem__
    result = self.index.get_value(self, key)
  File "/usr/lib/python3.4/site-packages/pandas/core/index.py", line   1417, in get_value
    return self._engine.get_value(s, k)
  File "pandas/index.pyx", line 100, in pandas.index.IndexEngine.get_value (pandas/index.c:3097)
  File "pandas/index.pyx", line 108, in pandas.index.IndexEngine.get_value (pandas/index.c:2826)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3692)
  File "pandas/hashtable.pyx", line 381, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7201)
  File "pandas/hashtable.pyx", line 387, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7139)
KeyError: 0

(It says line 2 in the error code because I was trying the code in the python shell. So line 2 refers to the last line of the code above.)

Hopefully someone can help me :). Thanks

EDIT

Ok, it seems like the splitting is not working as I thought it would. I did get two arrays as I wanted but somehow the lines are still as if it was one file. So the array train is from 0 to 1830 and the array test is from 1831 to 2034... so the range was wrong... how would I go about splitting up the csv file "correctly"?

2 EDIT

>>> print(train[0:5])
                                               0         1
0  the angel is going to miss the athlete this we...  negative 
1  It looks as though Shaq is getting traded to C...  negative
2     @clarianne APRIL 9TH ISN'T COMING SOON ENOUGH   negative
3  drinking a McDonalds coffee and not understand...  negative
4  So dissapointed Taylor Swift doesnt have a Twi...  negativ

>>> print(test[0:5])
                                                  0         1
1831  Why is my PSP always dead when I want to use it?   negative
1832  @hillaryrachel oh i know how you feel. i took ...  negative
1833  @daveknox awesome-  corporate housing took awa...  negative
1834  @lakersnation Is this a joke?  I can't find them   negative
1835                              XBox Live still down   negative

So as you can see the array "test" starts at the line number 1831. I would've thought it would start at 0... I fixed my problem now by editing the range in the for loop

for i in range(len(train[0], len(data)):

So my original problem is fixed, I'm just curious and eager to learn to write better code. Is this an ok thing to do or should I split the csv file in a different way?

Upvotes: 4

Views: 7567

Answers (1)

TheBlackCat
TheBlackCat

Reputation: 10298

When you do test[0], you are not getting the first index of test, it is more like you are getting the column of test with the "name" 0. When you split the pandas DataFrame in two, the original column names were preserved. This means that for the test DataFrame, it has no columns 0, since that column is in the first DataFrame.

Let me give you an example. Say you have the following DataFrame:

       0   1   2   3   4   5   6   7   8   9
Ind1   0   1   2   3   4   5   6   7   8   9
Ind2  10  11  12  13  14  15  16  17  18  19

When you split it, you end up with these DataFrames:

       0   1   2   3   4
Ind1   0   1   2   3   4
Ind2  10  11  12  13  14

and:

       5   6   7   8   9
Ind1   5   6   7   8   9
Ind2  15  16  17  18  19

Notice that the columns of the second DataFrame starts with 5, not 0, because those were the column names before the split. So when you try to get column 0, it isn't there. That is the source of your error.

The simplest solution would just be to use the index, rather than the column name. So instead of something like test[0], use test.iloc[0]. That will give the value based on positional index.

Upvotes: 2

Related Questions