Reputation: 945
sorry for the vague title, but since I don't really know what the problem is... the thing is that I want to load a CSV file, then split it up into two arrays and perform a function on each of those arrays. It works for the first array but the second one is making problems even though every thing is the same. I'm really stuck. The Code is as follows:
from wordutility import wordutility
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import cross_validation
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
import pandas as pd
import numpy as np
data = pd.read_csv('sts_gold_tweet.csv', header=None, delimiter=';',
quotechar='"')
# test = pd.read_csv('output.csv', header=None,
# delimiter=';', quotechar='"')
split_ratio = 0.9
train = data[:round(len(data)*split_ratio)]
test = data[round(len(data)*split_ratio):]
y = data[1]
print("Cleaning and parsing tweets data...\n")
traindata = []
for i in range(0, len(train[0])):
traindata.append(" ".join(wordutility.tweet_to_wordlist
(train[0][i], False)))
testdata = []
for i in range(0, len(test[0])):
testdata.append(" ".join(wordutility.tweet_to_wordlist(test[0][i], False)))
The program works up until the very last line. The error is:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python3.4/site-packages/pandas/core/series.py", line 509, in __getitem__
result = self.index.get_value(self, key)
File "/usr/lib/python3.4/site-packages/pandas/core/index.py", line 1417, in get_value
return self._engine.get_value(s, k)
File "pandas/index.pyx", line 100, in pandas.index.IndexEngine.get_value (pandas/index.c:3097)
File "pandas/index.pyx", line 108, in pandas.index.IndexEngine.get_value (pandas/index.c:2826)
File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3692)
File "pandas/hashtable.pyx", line 381, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7201)
File "pandas/hashtable.pyx", line 387, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7139)
KeyError: 0
(It says line 2 in the error code because I was trying the code in the python shell. So line 2 refers to the last line of the code above.)
Hopefully someone can help me :). Thanks
EDIT
Ok, it seems like the splitting is not working as I thought it would. I did get two arrays as I wanted but somehow the lines are still as if it was one file. So the array train is from 0 to 1830 and the array test is from 1831 to 2034... so the range was wrong... how would I go about splitting up the csv file "correctly"?
2 EDIT
>>> print(train[0:5])
0 1
0 the angel is going to miss the athlete this we... negative
1 It looks as though Shaq is getting traded to C... negative
2 @clarianne APRIL 9TH ISN'T COMING SOON ENOUGH negative
3 drinking a McDonalds coffee and not understand... negative
4 So dissapointed Taylor Swift doesnt have a Twi... negativ
>>> print(test[0:5])
0 1
1831 Why is my PSP always dead when I want to use it? negative
1832 @hillaryrachel oh i know how you feel. i took ... negative
1833 @daveknox awesome- corporate housing took awa... negative
1834 @lakersnation Is this a joke? I can't find them negative
1835 XBox Live still down negative
So as you can see the array "test" starts at the line number 1831. I would've thought it would start at 0... I fixed my problem now by editing the range in the for loop
for i in range(len(train[0], len(data)):
So my original problem is fixed, I'm just curious and eager to learn to write better code. Is this an ok thing to do or should I split the csv file in a different way?
Upvotes: 4
Views: 7567
Reputation: 10298
When you do test[0]
, you are not getting the first index of test
, it is more like you are getting the column of test
with the "name" 0
. When you split the pandas DataFrame in two, the original column names were preserved. This means that for the test
DataFrame, it has no columns 0
, since that column is in the first DataFrame.
Let me give you an example. Say you have the following DataFrame:
0 1 2 3 4 5 6 7 8 9
Ind1 0 1 2 3 4 5 6 7 8 9
Ind2 10 11 12 13 14 15 16 17 18 19
When you split it, you end up with these DataFrames:
0 1 2 3 4
Ind1 0 1 2 3 4
Ind2 10 11 12 13 14
and:
5 6 7 8 9
Ind1 5 6 7 8 9
Ind2 15 16 17 18 19
Notice that the columns of the second DataFrame starts with 5
, not 0
, because those were the column names before the split. So when you try to get column 0
, it isn't there. That is the source of your error.
The simplest solution would just be to use the index, rather than the column name. So instead of something like test[0]
, use test.iloc[0]
. That will give the value based on positional index.
Upvotes: 2