Reputation: 257
I am practicing Pandas and have the following task:
Create a list whose elements are the # of columns of each .csv file
.csv files are stored in the dictionary directory
keyed by year
I use a dictionary comprehension dataframes
(again keyed by year) to store the .csv files as pandas dataframes
directory = {2009: 'path_to_file/data_2009.csv', ... , 2018: 'path_to_file/data_2018.csv'}
dataframes = {year: pandas.read_csv(file) for year, file in directory.items()}
# My Approach 1
columns = [df.shape[1] for year, df in dataframes.items()]
# My Approach 2
columns = [dataframes[year].shape[1] for year in dataframes]
Which way is more "Pythonic"? Or is there a better way to approach this?
Upvotes: 4
Views: 450
Reputation: 2771
import os
#use this to find files under certain dir, you can filter it if there are other files
target_files = os.listdir('path_to_file/')
columns = list()
for filename in train_files:
#in your scenario @piRSquared's answer would be more efficient.
columns.append(#column_numbers)
If you want columns with the key by year from the filename, you can filter the filename and update dictionary like this:
year = filename.replace(r'[^0-9]', '')
Upvotes: 2
Reputation: 192
Your Approach 2:
columns = [dataframes[year].shape[1] for year in dataframes]
is more Pythonic and concise with the future use of dataframes in merging, plotting, manipulating, etc.since the keys are implied in the comprehension and shape gives the number of columns
Upvotes: 4
Reputation: 149
You could use:
columns = [len(dataframe.columns) for dataframe in dataframes.values()]
As @piRSquared mentioned if your only objective is to get the number of columns in the dataframe you shouldn't read the entire csv file, instead use the nrows keyword argument of the read_csv function.
Upvotes: 3
Reputation: 294228
Your method will get it done... but I don't like reading in the entire file and creating a dataframe just to count the columns. You could do the same thing by just reading the first line of each file and counting the number of commas. Notice that I add 1
because there will always be one less comma than there are columns.
columns = [open(f).readline().count(',') + 1 for _, f in directory.items()]
Upvotes: 4