Reputation: 646
As I was experimenting with pandas, I noticed some odd behavior of pandas.read_csv and was wondering if someone with more experience could explain what might be causing it.
To start, here is my basic class definition for creating a new pandas.dataframe from a .csv file:
import pandas as pd
class dataMatrix:
def __init__(self, filepath):
self.path = filepath # File path to the target .csv file.
self.csvfile = open(filepath) # Open file.
self.csvdataframe = pd.read_csv(self.csvfile)
Now, this works pretty well and calling the class in my __ main __.py successfully creates a pandas dataframe:
From dataMatrix.py import dataMatrix
testObject = dataMatrix('/path/to/csv/file')
But I was noticing that this process was automatically setting the first row of the .csv as the pandas.dataframe.columns index. Instead, I decided to number the columns. Since I didn't want to assume I knew the number of columns before hand, I took the approach of opening the file, loading it into a dataframe, counting the columns, and then reloading the dataframe with the proper number of columns using range().
import pandas as pd
class dataMatrix:
def __init__(self, filepath):
self.path = filepath
self.csvfile = open(filepath)
# Load the .csv file to count the columns.
self.csvdataframe = pd.read_csv(self.csvfile)
# Count the columns.
self.numcolumns = len(self.csvdataframe.columns)
# Re-load the .csv file, manually setting the column names to their
# number.
self.csvdataframe = pd.read_csv(self.csvfile,
names=range(self.numcolumns))
Keeping my processing in __ main __.py the same, I got back a dataframe with the correct number of columns (500 in this case) with proper names (0...499), but it was otherwise empty (no row data).
Scratching my head, I decided to close self.csvfile and reload it like so:
import pandas as pd
class dataMatrix:
def __init__(self, filepath):
self.path = filepath
self.csvfile = open(filepath)
# Load the .csv file to count the columns.
self.csvdataframe = pd.read_csv(self.csvfile)
# Count the columns.
self.numcolumns = len(self.csvdataframe.columns)
# Close the .csv file. #<---- +++++++
self.csvfile.close() #<---- Added
# Re-open file. #<---- Block
self.csvfile = open(filepath) #<---- +++++++
# Re-load the .csv file, manually setting the column names to their
# number.
self.csvdataframe = pd.read_csv(self.csvfile,
names=range(self.numcolumns))
Closing the file and re-opening it returned correctly with a pandas.dataframe with columns numbered 0...499 and all 255 subsequent rows of data.
My question is why does closing the file and re-opening it make a difference?
Upvotes: 8
Views: 9001
Reputation: 879133
When you open a file with
open(filepath)
a file handle iterator is returned. An iterator is good for one pass through its contents. So
self.csvdataframe = pd.read_csv(self.csvfile)
reads the contents and exhausts the iterator. Subsequent calls to pd.read_csv
thinks the iterator is empty.
Note that you could avoid this problem by just passing the file path to pd.read_csv
:
class dataMatrix:
def __init__(self, filepath):
self.path = filepath
# Load the .csv file to count the columns.
self.csvdataframe = pd.read_csv(filepath)
# Count the columns.
self.numcolumns = len(self.csvdataframe.columns)
# Re-load the .csv file, manually setting the column names to their
# number.
self.csvdataframe = pd.read_csv(filepath,
names=range(self.numcolumns))
pd.read_csv
will then open (and close) the file for you.
PS. Another option is to reset the file handle to the beginning of the file by calling self.csvfile.seek(0)
, but using pd.read_csv(filepath, ...)
is still easier.
Even better, instead of calling pd.read_csv
twice (which is inefficient), you could rename the columns like this:
class dataMatrix:
def __init__(self, filepath):
self.path = filepath
# Load the .csv file to count the columns.
self.csvdataframe = pd.read_csv(filepath)
self.numcolumns = len(self.csvdataframe.columns)
self.csvdataframe.columns = range(self.numcolumns)
Upvotes: 8