Reputation: 13
I am assigning dtypes after importing them from a csv file because of the nature of the data's initial formatting and currently have an array of dtypes that I assign as I go through each csv file.
Example:
dtypes = [{"ID": str, "Place": str, "Speed":float, "Major Street": str},
{"ID": str, "Place": str, "Speed":float, "Roy's Boulevard": str},
{"ID": str, "Place": str, "Speed":float, "Cassandra Park": str}]
I'd prefer to have this array in a different file so that I don't clutter up my code with a huge array later on. Should I have this in a separate python file that I import, or is there a way to split/strip from a csv?
An example of how each individual csv looks as follows:
Additional code to show what I'm trying to accomplish with this dictionary:
for n in tq(range(4)):
df = pd.read_csv(r'S:\Energy\'+ meterfile[n] + '.csv', parse_dates=[['Date', 'Time']], skiprows=[1], usecols=dictionary[n])
first_nan = df[df.Date_Time == 'nan nan'].first_valid_index()
df = df[:first_nan]
df.fillna(0)
df = df.astype(dtypes[n])
df.to_csv(r'S:\Energy\Database Progress\CleanedWorksheets\\' +
meterfile[n] + '.csv', index=False)
Upvotes: 0
Views: 1736
Reputation: 33970
Options:
i.e don't specify dtypes, they will default to string. Then fix up the non-string columns after (you only have 'Speed': float
in your example).
import pandas as pd
df = pd.read_csv(..., dtype = {'ID': 'object', 'Place': 'object', 'Speed': float}, ...)
# Then fix up any dtype after the read:
df['misread_col'] = df['misread_col'].astype(...)
See the doc on dtype
OR:
Specify the dtype for each column at read-time: pd.read_csv(... dtype={...} ...)
, as a dict of 'name': dtype
. This would require you to know the column names in advance (or maybe use regexes to assign dtypes based on name). And in order to do that programmatically, you'd probably have to first read in the header row of each CSV file and map it to dtypes.
And if you want to programmatically read in the header row of each CSV file and map it to dtypes:
import glob
for f in glob.glob('*.csv'):
# First, read header row, to infer dtypes
df = pd.read_csv(f, header=[0])
#print(df.columns)
# Here you can process df.columns, add it into a dtype_dict, whatever
# Second, reread entire file
df = pd.read_csv(f, dtype=...)
Honestly, doing 1) could be less grief than 2), it depends. Just read everything in, tweak your dtypes specifiers, run it again. If you're debugging this, just read with say nrow=100
.
Upvotes: 1