Reputation: 528
The goal of my project is to predict the accuracy level of some textual descriptions.
I made the vectors with FASTTEXT.
TSV output:
0 1:0.0033524514 2:-0.021896651 3:0.05087798 4:0.0072637126 ...
1 1:0.003118149 2:-0.015105667 3:0.040879637 4:0.000539902 ...
Resources are labeled as Good (1) or Bad (0).
To check the accuracy I used scikit-learn and SVM.
Following this tutorial I made this script:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
r_filenameTSV = 'TSV/A19784.tsv'
tsv_read = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])
df = pd.DataFrame(tsv_read)
df = pd.DataFrame(df.vector.str.split(' ',1).tolist(),
columns = ['label','vector'])
print ("Features:" , df.vector)
print ("Labels:" , df.label)
X_train, X_test, y_train, y_test = train_test_split(df.vector, df.label, test_size=0.2,random_state=0)
#Create a svm Classifier
clf = svm.SVC(kernel='linear')
#Train the model using the training sets
clf.fit (str((X_train, y_train)))
#Predict the response for test dataset
y_pred = clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
The first time I tried to run the script I got this error on line 28:
ValueError: could not convert string to float:
So I changed from
clf.fit (X_train, y_train)
to
clf.fit (str((X_train, y_train)))
Then, on the same line, I got this error
TypeError: fit() missing 1 required positional argument: 'y'
Suggestions on how to solve this issue?
Kind regards and thanks for your time.
Upvotes: 5
Views: 31379
Reputation: 1571
Like mentioned in the comments below your question your features and your label are persumably strings. However, sklearn requires them to be numeric (sklearn is normally used with numpy arrays). If this is the case you have to convert the elements of your dataframe from strings to numeric values.
Looking at your code I assume that each element of your feature column is a list of strings and each element of your label column is a single string. Here is an example of how such a dataframe can be converted to contain numeric values.
import numpy as np
import pandas as pd
df = pd.DataFrame({'features': [['5', '4.2'], ['3', '7.9'], ['2', '9']],
'label': ['1', '0', '0']})
print(type(df.features[0][0]))
print(type(df.label[0]))
def convert_to_float(collection):
floats = [float(el) for el in collection]
return np.array(floats)
df_numeric = pd.concat([df["features"].apply(convert_to_float),
pd.to_numeric(df["label"])],
axis=1)
print(type(df_numeric.features[0][0]))
print(type(df_numeric.label[0]))
However, the described dataframe format is not the format sklearn models expect pandas dataframes to have. As far as I know sklearn models expect each feature to be stored in a seperate column, like it is the case here:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
feature_df = pd.DataFrame(np.arange(6).reshape(3, 2), columns=["feature_1", "feature_2"])
label_df = pd.DataFrame(np.array([[1], [0], [0]]), columns=["label"])
df = pd.concat([feature_df, label_df], axis=1)
X_train, X_test, y_train, y_test = train_test_split(df.drop(["label"], axis=1), df["label"], test_size=1 / 3)
clf = SVC(kernel='linear')
clf.fit(X_train, y_train)
clf.predict(X_test)
That is, after converting your dataframe so that it only contains numeric values, you'd have to create an own column for each element in the lists of your feature column. You could do so like this:
arr = np.concatenate(df_numeric.features.to_numpy()).reshape(df_numeric.shape)
df_sklearn_compatible = pd.concat([pd.DataFrame(arr, columns=["feature_1", "feature_2"]),
df["label"]],
axis=1)
Upvotes: 3