Reputation: 373
I get the following error once i updated sklearn to a newer version - i don't know why this is .
Traceback (most recent call last):
File "/Users/X/Courses/Project/SupportVectorMachine/main.py", line 95, in <module>
y, x = dmatrices(formula, data=finalDataFrame, return_type='matrix')
File "/Library/Python/2.7/site-packages/patsy/highlevel.py", line 297, in dmatrices
NA_action, return_type)
File "/Library/Python/2.7/site-packages/patsy/highlevel.py", line 156, in _do_highlevel_design
return_type=return_type)
File "/Library/Python/2.7/site-packages/patsy/build.py", line 947, in build_design_matrices
value, is_NA = evaluator.eval(data, NA_action)
File "/Library/Python/2.7/site-packages/patsy/build.py", line 85, in eval
return result, NA_action.is_numerical_NA(result)
File "/Library/Python/2.7/site-packages/patsy/missing.py", line 135, in is_numerical_NA
mask |= np.isnan(arr)
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule 'safe'
This is the code corresponding to this. I have reinstalled and installed everything from Numpy to scipy patsy etc. But nothing works.
# Merging the two dataframes - user and the tweets
finalDataFrame = pandas.merge(twitterDataFrame.reset_index(),twitterUserDataFrame.reset_index(),on=['UserID'],how='inner')
finalDataFrame = finalDataFrame.drop_duplicates()
finalDataFrame['FrequencyOfTweets'] = numpy.all(numpy.isfinite(finalDataFrame['FrequencyOfTweets']))
# model formula, ~ means = and C() lets the classifier know its categorical data.
formula = 'Classifier ~ InReplyToStatusID + InReplyToUserID + RetweetCount + FavouriteCount + Hashtags + UserMentionID + URL + MediaURL + C(MediaType) + UserMentionID + C(PossiblySensitive) + C(Language) + TweetLength + Location + Description + UserAccountURL + Protected + FollowersCount + FriendsCount + ListedCount + UserAccountCreatedAt + FavouritesCount + GeoEnabled + StatusesCount + ProfileBackgroundImageURL + ProfileUseBackgroundImage + DefaultProfile + FrequencyOfTweets'
### create a regression friendly data frame y gives the classifiers, x gives the features and gives different columns for Categorical data depending on variables.
y, x = dmatrices(formula, data=finalDataFrame, return_type='matrix')
## select which features we would like to analyze
X = numpy.asarray(x)
Upvotes: 0
Views: 100
Reputation: 373
After a lot of looking through code etc the problem was the formula I was passing wanted the program to use all the features below. Here the 'UserAccountCreatedAt'column was of type datetime[ns]. I have currently taken this off the formula and have no errors however, I would like to know how best to convert this to numeric data in order to actually pass it through. This is because categorical data is handled by C in front of some of the columns as seen below and datetime is considered numeric in patsy.
formula = 'Classifier ~ UserAccountCreatedAt + InReplyToStatusID + InReplyToUserID + RetweetCount + FavouriteCount + Hashtags + UserMentionID + URL + MediaURL + C(MediaType) + UserMentionID + C(PossiblySensitive) + C(Language) + TweetLength + Location + Description + UserAccountURL + Protected + FollowersCount + FriendsCount + ListedCount + FavouritesCount + GeoEnabled + StatusesCount + ProfileBackgroundImageURL + ProfileUseBackgroundImage + DefaultProfile + FrequencyOfTweets'
Upvotes: 0
Reputation: 2463
I've found that error to crop up sometimes when calling np.isnan on an array that contains strings or other non-float values. Try casting your np.arrays using arr.astype(float) before passing them in to dmatrices.
Also, your frequency of tweets column is being set to all False or all True, since np.all returns a scalar.
Upvotes: 1