Aserian
Aserian

Reputation: 1127

Pandas iloc wrong index causing problems with subtraction

I should start by saying that I am quite new to pandas and numpy (and machine learning in general).

I am trying to learn some basic machine learning algorithms and am doing linear regression. I have completed this problem using matlab, but wanted to try implementing it in python - as that is a more practically used language. I am having a very difficult time doing basic matrix operations with these libraries and I think it's down to a lack of understanding of how pandas is indexing the dataframe...

I have found several posts talking about the differences between iloc and ix and that ix is being deprecated so use iloc, but iloc is causing me loads of issues. I am simply trying to pull the first n-1 columns out of a dataframe into a new dataframe, then the final column into another dataframe to get my label values. Then I want to perform the cost function one time to see what my current cost is with theta = 0. Currently, my dataset has only one label - but I'd like to code as if I had more. Here is my code and my output:

path = os. getcwd() + '\\ex1data1.txt'
data = pd.read_csv(path, header=None)

numRows = data.shape[0]
numCols = data.shape[1]

X = data.iloc[:,0:numCols-1].copy()
theta = pd.DataFrame(np.zeros((X.shape[1], 1)))
y = data.iloc[:,-1].copy()

#start computing cost sum((X-theta)-y).^2)
predictions = X.dot(theta)
print("predictions shape: {0}".format(predictions.shape))
print(predictions.head())
print("y shape: {0}".format(y.shape))
print(y.head())

errors = predictions.subtract(y)

print("errors shape: {0}".format(errors.shape))
print(errors.head())

output:

predictions shape: (97, 1)
 0
0  0.0
1  0.0
2  0.0
3  0.0
4  0.0
y shape: (97, 1)
     1
0  17.5920
1   9.1302
2  13.6620
3  11.8540
4   6.8233
errors shape: (97, 2)
0   1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN

I can see that y and X have the same shape, but for some reason when I display them - it seems that y is beginning its indexing at column 1 (it's original position in the first dataframe) and X has its original column of 0. As a result, pandas is properly doing the subtraction and replacing any missing values with NaN. As y has no column 0 values, they are all NaN, and as X has no column 1 values, they are all NaN, resulting in a 97x2 NaN matrix.

If I use ‍‍‍‍‍‍y = data.ix[:,-1:0] - the above code does the correct calculations. Output:

 errors shape: (97, 1)
         0
     0 -6.1101
     1 -5.5277
     2 -8.5186
     3 -7.0032
     4 -5.8598

But I am trying to stay away from ix as it has been said it is deprecating.

How to I tell pandas that the new matrix has a start column of 0 and why is this not the default behavior?

Upvotes: 1

Views: 1274

Answers (1)

Sven Harris
Sven Harris

Reputation: 2939

Looks like the calculation you actually want to do is on the series (individual columns). So you should be able to do:

predictions[0].subtract(y[1])

To get the value you want. This looks kind of confusing because you have numbers as DataFrame columns, you are selecting the columns you want (0 and 1) and performing the subtraction between them.

Or using iloc as you originally suggested, which gives you more like matrix style indexing you could do this:

predictions.iloc[:, 0].subtract(y.iloc[:, 0])

Because in each DataFrame you want all the rows and the first column

Upvotes: 2

Related Questions