Reputation: 65
I know very little about python's pandas module. I need to create a DataFrame
and store it in .csv file for my project. I am using to_csv
and read_csv
functions. However, when I compared the two frames (before exporting and the imported one) I got different results. This is the the minimal reproducible example:
import sys
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
documents = []
documents.append("i love python")
documents.append("foo bar")
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
X = X.T.toarray()
df = pd.DataFrame(X, index=vectorizer.get_feature_names())
df.to_csv(path_or_buf = "db.csv")
df1 = pd.read_csv("db.csv")
print(df.axes)
print()
print(df1.axes)
And this is what is printed:
[Index(['bar', 'foo', 'love', 'python'], dtype='object'), RangeIndex(start=0, stop=2, step=1)]
[RangeIndex(start=0, stop=4, step=1), Index(['Unnamed: 0', '0', '1'], dtype='object')]
How can I make the DataFrame
imported from a .csv file identical to the original one?
Upvotes: 0
Views: 565
Reputation: 339
UPDATE:Give index name for the dataframe you are exporting and while reading the exported csv use that name as index. Here I am using vectors
as index name
import sys
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
documents = []
documents.append("i love python")
documents.append("foo bar")
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
X = X.T.toarray()
df = pd.DataFrame(X, index=vectorizer.get_feature_names())
df.index.name = 'vectors'
df.to_csv(path_or_buf="db.csv")
df1 = pd.read_csv("db.csv",index_col='vectors')
print(df)
print()
print(df1)
Old answer: Try exporting csv without index by setting index to false as
df.to_csv(path_or_buf="db.csv", index=False)
Upvotes: 1