Reputation: 111
I am trying to load a semicolon seperated txt file and there are a few instances where escape chars are in the data. These are typically < ; (space removed so it isn't covered to <) which adds a semicolon. This obviously messes up my data and since dtypes are important causes read_csv problems. Is there away to tell pandas to ignore these when the file is read?
I tried deleting the char from the file and it works now, but given that I want an automated process on millions of rows this is not sustainable.
df = pd.read_csv(file_loc.csv,
header=None,
names=column_names,
usecols=counters,
dtype=dtypes,
delimiter=';',
low_memory=False)
ValueError: could not convert string to float:
As my first column is a string and the second is a float, but if the first is split by the < ; it then goes on the 2nd too.
Is there a way to tell pandas to ignore these or efficiently remove before loading?
Upvotes: 0
Views: 940
Reputation: 30579
Give the following example csv file so57732330.csv
:
col1;col2
1<2;a
3;
we read it using StringIO
after unescaping named and numeric html5 character references:
import pandas as pd
import io
import html
with open('so57732330.csv') as f:
s = f.read()
f = io.StringIO(html.unescape(s))
df = pd.read_csv(f,sep=';')
Result:
col1 col2
0 1<2 a
1 3 NaN
Upvotes: 1