Reputation: 1196
I am manipulating data using pandas and Python3.4. I am having a problem with a specific csv file. I don't know why, even with nan
values, pandas usally reads columns as float
. Here it reads them as string
. Here is what my csv file looks like:
Date RR TN TX
08/10/2015 0 10.5 19.5
09/10/2015 0 5.5 20
10/10/2015 0 5 24
11/10/2015 0.5 7 24.5
12/10/2015 3 12 23
...
27/04/2017
28/04/2017
29/04/2017
30/04/2017
01/05/2017
02/05/2017
03/05/2017
04/05/2017
The problem is I can't convert it to float
because of the nan
values at the end. I need them as float
because I am trying to do TN
+ TX
.
This is what I tried so far:
When reading the file:
dfs[code] = pd.read_csv(path, sep = ';', index_col = 0, parse_dates = True, encoding = 'ISO-8859-1', dtype = float)
I also tried:
dtype = {
'TN': np.float,
'TX': np.float
}
dfs[code] = pd.read_csv(path, sep = ';', index_col = 0, parse_dates = True, encoding = 'ISO-8859-1', dtype = dtype)
Otherwise, at the moment to perform the addition I also tried:
tn = dfs[code]['TN'].astype(float)
tx = dfs[code]['TX'].astype(float)
formatted_dfs[code] = tn + tx
But I always get the same error:
ValueError: could not convert string to float.
I know I could do it row by row, testing if the value is nan
, but I am pretty sure there is an easier way. Do you know how to do that? Or will I have to do it row by row? Thanks.
Upvotes: 1
Views: 6931
Reputation: 893
Add the convert parameter in read method - converters={'TN':float,'TX':float}
dfs[code] = pd.read_csv(path, sep = ';',converters={'TN':float,'TX':float}, index_col = 0, parse_dates = True, encoding = 'ISO-8859-1', dtype = float)
Upvotes: 0
Reputation: 4471
You can see that if you allow pandas to detect dtypes itself, you avoid the ValueError and uncover the underlying problem.
In [4]: df = pd.read_csv(path, sep=';', index_col=0, parse_dates=True, low_memory=False)
In [5]: df
Out[5]:
Empty DataFrame
Columns: []
Index: [08/10/2015 0 10.5 19.5, 09/10/2015 0 5.5 20, 10/10/2015 0 5 24, 11/10/2015 0.5 7 24.5, 12/10/2015 3 12 23, 27/04/2017 , 28/04/2017 , 29/04/2017 , 30/04/2017 , 01/05/2017 , 02/05/2017 , 03/05/2017 , 04/05/2017 ]
It seems you specify the separator as ';'
by accident, as your file is whitespace delimited. Since there aren't any semi-colons, entire rows are read into the index.
First, try reading in your file using the proper separator
df = pd.read_csv(path, delim_whitespace=True, index_col=0, parse_dates=True, low_memory=False)
Now, some of the rows have incomplete data. A simple solution conceptually is to try to convert values to np.float
, and replace them with np.nan
otherwise.
def f(x):
try:
return np.float(x)
except:
return np.nan
df["TN"] = df["TN"].apply(f)
df["TX"] = df["TX"].apply(f)
print(df.dtypes)
This returns, as desired
RR object
TN float64
TX float64
dtype: object
Upvotes: 5