Reputation: 3
# -*- coding: utf-8 -*-
"""
Created on Sun Jun 3 01:36:10 2018
@author: Sharad
"""
import numpy as np
import pickle
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
dbfile=open("D:/df_train_api.pk", 'rb')
df=pickle.load(dbfile)
y=df[['label']]
features=['groups']
X=df[features].copy()
X.columns
y.columns
#for spiliting into training and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)
#for vectorizing
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, y_train)
The problem lies in the vectorisationg as it gives me X_train_counts of size [1,1]. I don't know why. And that's why MultinomialNB can't perform the action as y_train is of size [1, 3185]. I'm new to machine learning. Any help would be much appreciated.
traceback:
Traceback (most recent call last):
File "<ipython-input-52-5b5949203f76>", line 1, in <module>
runfile('C:/Users/Sharad/.spyder-py3/hypothizer.py', wdir='C:/Users/Sharad/.spyder-py3')
File "C:\Users\Sharad\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "C:\Users\Sharad\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Sharad/.spyder-py3/hypothizer.py", line 37, in <module>
clf = MultinomialNB().fit(X_train_tfidf, y_train)
File "C:\Users\Sharad\Anaconda3\lib\site-packages\sklearn\naive_bayes.py", line 579, in fit
X, y = check_X_y(X, y, 'csr')
File "C:\Users\Sharad\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 583, in check_X_y
check_consistent_length(X, y)
File "C:\Users\Sharad\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 204, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [1, 3185]
Upvotes: 0
Views: 1015
Reputation: 36599
CountVectorizer (and by inheritence, TfidfTransformer and TfidfVectorizer) expects an iterable of raw documents in fit()
and fit_transform()
:
raw_documents : iterable An iterable which yields either str, unicode or file objects.
So internally it will do this:
for doc in raw_documents:
do_processing(doc)
When you pass a pandas DataFrame object in it, only the column names will be yielded by the for ... in X
. And hence only a single document is processed (instead of data inside that column).
You need to do this:
X = df[features].values().ravel()
Or else do this:
X=df['groups'].copy()
There is a difference in the code above and the code you are doing. You are doing this:
X=df[features].copy()
Here features
is already a list of columns. So essentially this becomes:
X=df[['groups']].copy()
The difference is in the double brackets here (which return a dataframe) and single bracket in my code (which returns a Series).
for value in X
works as expected when X is a series, but only returns column names when X is a dataframe.
Hope this is clear.
Upvotes: 1