python TfidfVectorizer gives typeError: expected string or bytes-like object on csv file

I am analyzing a very large csv file and trying to extract tf-idf information from it using scikit. Unfortunately, I never finish processing the data since it throws this typeError. Is there a way to programmatically alter the csv file to eliminate this error? Here is my code:

    df = pd.read_csv("C:/Users/aidan/Downloads/papers/papers.csv", sep = None)
df =  df[pd.notnull(df)]

    n_features = 1000
    n_topics = 8
    n_top_words = 10
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,max_features=n_features,stop_words='english', lowercase = False)

tfidf = tfidf_vectorizer.fit_transform(df['paper_text'])

The error is raised from the last line. Thank you in advance!

Traceback (most recent call last):
  File "C:\Users\aidan\NIPS Analysis 2.0.py", line 35, in <module>
    tfidf = tfidf_vectorizer.fit_transform(df['paper_text'])
  File "c:\python\python36\lib\site-packages\sklearn\feature_extraction\text.py", line 1352, in fit_transform
    X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  File "c:\python\python36\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform
    self.fixed_vocabulary_)
  File "c:\python\python36\lib\site-packages\sklearn\feature_extraction\text.py", line 762, in _count_vocab
    for feature in analyze(doc):
  File "c:\python\python36\lib\site-packages\sklearn\feature_extraction\text.py", line 241, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "c:\python\python36\lib\site-packages\sklearn\feature_extraction\text.py", line 216, in <lambda>
    return lambda doc: token_pattern.findall(doc)
TypeError: expected string or bytes-like object

Upvotes: 2

Answers (3)

Hamed Baziyad

Reputation: 2019

Read your files in this way:

df = pd.read_csv("C:/Users/aidan/Downloads/papers/papers.csv",dtype=str)

In fact type of your elements should be string.

Upvotes: 0

Vadim

Reputation: 4529

In my case the problem was I had NaNs in the dataframe. Replacing NaNs helped me.

df.fillna('0')

Upvotes: 1

neox

Reputation: 81

Have you checked df.dtypes? What's the output?

You could try to add dtype=str as an argument to the .read_csv() call.

Upvotes: 2

python TfidfVectorizer gives typeError: expected string or bytes-like object on csv file

Answers (3)

Related Questions