Reputation: 327
I am reading csv and I do not want the dataypes of columns as object, they should be int, float, str etc.
data = pd.read_csv(file_path+files, delimiter='\t', error_bad_lines=False)
data.dtypes:
Time object
Code int64
Address object
dtype: object
Is there any way that we could read datatypes originally as they are from csv while reading:
Expected:
data.dtypes:
Time int
Code int64
Address str
I have a dataframe that looks like:
df:
A B C
abc 10 20
def 30 50
cfg 90 60
pqr str 50
xyz 75 56
I want to get rid of the row where column 'B' is not 'int'. As the dtype of B is set as 'object' I am unable to do so.
Upvotes: 8
Views: 18944
Reputation: 2182
#ex.csv
# -0.11566111265093704,0.7655813,0
# 0.8792716084627679,0.82952684,1
# 0.5744048344633055,0.8762405,2
# -0.6245665678004078,0.24478662,3
# -0.33955465349370706,-0.042879142,4
curfile = pd.read_csv("ex.csv", dtype={0: np.float64, 1: np.float32, 2: int}, header=None)
print(type(curfile.iloc[0,0]), type(curfile.iloc[0,1]), type(curfile.iloc[0,2]))
# <class 'numpy.float64'> <class 'numpy.float32'> <class 'numpy.int32'>
Upvotes: 0
Reputation: 883
To bypass Pandas' bad type inference, use a csv reader to feed strings to the DataFrame constructor.
with open('/tmp/test.csv', 'r') as fin:
csv_data = io.StringIO(fin.read())
df = pd.DataFrame([*csv.DictReader(csv_data)])
Upvotes: 2
Reputation: 3985
You can convert columns pretty easily for numeric types:
data['Time'] = data['Time'].astype(int)
The dtype for your string field is stuck as an object though, because it's a string object. It would be possible I believe to create a new dtype that's explicitly string, but I don't know of any advantages to doing that.
For your edited problem, what you want to do is define a converter (because your file does NOT have a defined data type for the column)
import numpy as np
def col_fixer(x):
try:
return int(x)
except ValueError:
return np.nan
data = pd.read_csv(file_path+files, delimiter='\t', converters=dict(B=col_fixer))
You can then discard rows with NAs however you'd like.
Upvotes: 1
Reputation: 19104
You can supply the dtype
kwarg to read_csv()
. From the docs:
dtype : Type name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.
e.g.
data = pd.read_csv(..., dtype={'Time': np.int64})
Edit: As @ALollz points out, this will break if the data in the specified column(s) cannot be converted. It is typically used if you want to read in data using different numbers of bits (e.g. np.int32
instead of np.int64
).
You can use df['Time'].astype(int)
on the DataFrame with ojbect
s to diagnose which data are causing the conversion issue.
Upvotes: 7