How do I import a list of dtypes from a CSV file?

Question

I am assigning dtypes after importing them from a csv file because of the nature of the data's initial formatting and currently have an array of dtypes that I assign as I go through each csv file.

Example:

dtypes = [{"ID": str, "Place": str, "Speed":float, "Major Street": str},
          {"ID": str, "Place": str, "Speed":float, "Roy's Boulevard": str},
          {"ID": str, "Place": str, "Speed":float, "Cassandra Park": str}]

I'd prefer to have this array in a different file so that I don't clutter up my code with a huge array later on. Should I have this in a separate python file that I import, or is there a way to split/strip from a csv?

An example of how each individual csv looks as follows:

https://pastebin.com/vW98iC3K

Additional code to show what I'm trying to accomplish with this dictionary:

  for n in tq(range(4)):
      df = pd.read_csv(r'S:\Energy\'+ meterfile[n] + '.csv', parse_dates=[['Date', 'Time']], skiprows=[1], usecols=dictionary[n])
              first_nan = df[df.Date_Time == 'nan nan'].first_valid_index()
              df = df[:first_nan]
              df.fillna(0)
              df = df.astype(dtypes[n])




              df.to_csv(r'S:\Energy\Database Progress\CleanedWorksheets\' +
              meterfile[n] + '.csv', index=False)

smci · Accepted Answer

Options:

1 Read the CSV in as default, fix the dtypes after

i.e don't specify dtypes, they will default to string. Then fix up the non-string columns after (you only have 'Speed': float in your example).

import pandas as pd
df = pd.read_csv(..., dtype = {'ID': 'object', 'Place': 'object', 'Speed': float}, ...)

# Then fix up any dtype after the read:
df['misread_col'] = df['misread_col'].astype(...)

See the doc on dtype

OR:

2 Programmatically figure out in advance what dtype each column should be

Specify the dtype for each column at read-time: pd.read_csv(... dtype={...} ...), as a dict of 'name': dtype. This would require you to know the column names in advance (or maybe use regexes to assign dtypes based on name). And in order to do that programmatically, you'd probably have to first read in the header row of each CSV file and map it to dtypes.

And if you want to programmatically read in the header row of each CSV file and map it to dtypes:

import glob

for f in glob.glob('*.csv'):
    # First, read header row, to infer dtypes
    df = pd.read_csv(f, header=[0])
    #print(df.columns)
    # Here you can process df.columns, add it into a dtype_dict, whatever

    # Second, reread entire file
    df = pd.read_csv(f, dtype=...)

Honestly, doing 1) could be less grief than 2), it depends. Just read everything in, tweak your dtypes specifiers, run it again. If you're debugging this, just read with say nrow=100.

How do I import a list of dtypes from a CSV file?

Answers (1)

1 Read the CSV in as default, fix the dtypes after

2 Programmatically figure out in advance what dtype each column should be

Related Questions