minjunkim7767
minjunkim7767

Reputation: 213

Pandas save and open then values changed to be string problem

Hi my problem is below:

  1. compute some vectors.
  2. put them in a column in pandas dataframe (column name is "test")
  3. save the dataframe as csv. (test.csv)
  4. read_csv the saved csv file: pd.read_csv("test.csv")
  5. realizing that the vectors are not numpy array but strings like below.
  '[[0.   0.   0.   0.123333.   0.\n    0.]\n
    [0.   0.   0.\n   0.123333.   0.    0.]\n
    [0.   0.222222.   0.   0.333333.   0.    0.]]'
  1. I tried something like this to solve the problem.
  test = pd.read_csv("test.csv")    
  np.array(literal_eval(test["vector"][0]))

i get this error

     File "<unknown>", line 1
        [[0.         0.         0.         0.         0.         0.
                      ^
    SyntaxError: invalid syntax

here I linked the download of the file I use. https://drive.google.com/file/d/1MnJjPb-Gj_44dRXUHbNO64b-Z-wSrHSc/view?usp=sharing

code to create vector and put in df

    from sklearn.feature_extraction.text import TfidfVectorizer
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_vectorizer.fit_transform(["example text","this is the list of words","like this"]).toarray()


    datadd = [["example text"],["this is the list of words"],["like this"]]
    vector = []
    for example in datadd:
        vector.append(tfidf_vectorizer.transform(example).toarray())
    pd.DataFrame({"vector":vector})
    pd.to_csv("test.csv")

Upvotes: 0

Views: 1262

Answers (2)

Serge Ballesta
Serge Ballesta

Reputation: 149075

A csv file is a plain text file. Just open it with a text editor like notepad++, vi or even notepad if you are using Windows. That means that what is saved in the csv file is, for each cell is just its text representation.

Pandas read_csv is smart enough to recognize floating point and integer values, but not lists, sets or numpy arrays. For date values, the parse_dates parameter can help, but AFAIK, nothing exists for numpy arrays. BTW, storing numpy arrays (or lists or other complex objects) in a pandas column is not a very clever idea because pandas will never be able to use its vectorized methods on it. Long story made short, and IMHO, storing complex objects in pandas is miss-using the tools.

Unfortunately, I know no simple way to convert a string representation (as build from str(arr)) back to the numpy array. So if you want to go that way you will have to write a parser in Python for it, and then apply it to the pandas column.

Upvotes: 0

Trenton McKinney
Trenton McKinney

Reputation: 62463

  • vector is a <class 'scipy.sparse.csr.csr_matrix'>

    • convert it to a list before loading it into the dataframe
  • Apply literal_eval to the entire column when reading the file in.

import pandas as pd
import numpy as np
from ast import literal_eval

# before writing vector to a dataframe
vector  = np.array(vector).tolist()
df = pd.DataFrame({"vector": vector})
df.to_csv("test.csv", index=False)

# after reading the csv file in
test = pd.read_csv('test.csv', converters={'vector': literal_eval})
print(type(test.iloc[0, 0]))
>>> <class 'list'>

Upvotes: 1

Related Questions