TjS
TjS

Reputation: 327

How to read csv without changing original datatypes in pandas

I am reading csv and I do not want the dataypes of columns as object, they should be int, float, str etc.

data = pd.read_csv(file_path+files, delimiter='\t', error_bad_lines=False)

data.dtypes:
  Time       object
  Code        int64
  Address     object
  dtype: object

Is there any way that we could read datatypes originally as they are from csv while reading:

Expected:

data.dtypes:
  Time        int
  Code        int64
  Address     str

I have a dataframe that looks like:

df:
    A     B    C
    abc   10   20
    def   30   50  
    cfg   90   60
    pqr   str  50
    xyz   75   56

I want to get rid of the row where column 'B' is not 'int'. As the dtype of B is set as 'object' I am unable to do so.

Upvotes: 8

Views: 18944

Answers (4)

bitbang
bitbang

Reputation: 2182

#ex.csv 
# -0.11566111265093704,0.7655813,0
# 0.8792716084627679,0.82952684,1
# 0.5744048344633055,0.8762405,2
# -0.6245665678004078,0.24478662,3
# -0.33955465349370706,-0.042879142,4

curfile = pd.read_csv("ex.csv", dtype={0: np.float64, 1: np.float32, 2: int}, header=None)

print(type(curfile.iloc[0,0]), type(curfile.iloc[0,1]), type(curfile.iloc[0,2]))

# <class 'numpy.float64'> <class 'numpy.float32'> <class 'numpy.int32'>

Upvotes: 0

Joel Bondurant
Joel Bondurant

Reputation: 883

To bypass Pandas' bad type inference, use a csv reader to feed strings to the DataFrame constructor.

with open('/tmp/test.csv', 'r') as fin:
    csv_data = io.StringIO(fin.read())
df = pd.DataFrame([*csv.DictReader(csv_data)])

Upvotes: 2

CJR
CJR

Reputation: 3985

You can convert columns pretty easily for numeric types:

data['Time'] = data['Time'].astype(int)

The dtype for your string field is stuck as an object though, because it's a string object. It would be possible I believe to create a new dtype that's explicitly string, but I don't know of any advantages to doing that.

For your edited problem, what you want to do is define a converter (because your file does NOT have a defined data type for the column)

import numpy as np

def col_fixer(x):
    try:
        return int(x)
    except ValueError:
        return np.nan

data = pd.read_csv(file_path+files, delimiter='\t', converters=dict(B=col_fixer))

You can then discard rows with NAs however you'd like.

Upvotes: 1

Alex
Alex

Reputation: 19104

You can supply the dtype kwarg to read_csv(). From the docs:

dtype : Type name or dict of column -> type, default None

Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

e.g.

data = pd.read_csv(..., dtype={'Time': np.int64})

Edit: As @ALollz points out, this will break if the data in the specified column(s) cannot be converted. It is typically used if you want to read in data using different numbers of bits (e.g. np.int32 instead of np.int64).

You can use df['Time'].astype(int) on the DataFrame with ojbects to diagnose which data are causing the conversion issue.

Upvotes: 7

Related Questions