Silveris
Silveris

Reputation: 1196

Convert a column from a pandas DataFrame to float with nan values

I am manipulating data using pandas and Python3.4. I am having a problem with a specific csv file. I don't know why, even with nan values, pandas usally reads columns as float. Here it reads them as string. Here is what my csv file looks like:

Date        RR  TN  TX
08/10/2015  0   10.5    19.5
09/10/2015  0   5.5 20
10/10/2015  0   5   24
11/10/2015  0.5 7   24.5
12/10/2015  3   12  23
...
27/04/2017           
28/04/2017           
29/04/2017           
30/04/2017           
01/05/2017           
02/05/2017           
03/05/2017           
04/05/2017           

The problem is I can't convert it to float because of the nan values at the end. I need them as float because I am trying to do TN + TX. This is what I tried so far:

When reading the file:

dfs[code] = pd.read_csv(path, sep = ';', index_col = 0, parse_dates = True, encoding = 'ISO-8859-1', dtype = float)

I also tried:

dtype = {
    'TN': np.float,
    'TX': np.float
}
dfs[code] = pd.read_csv(path, sep = ';', index_col = 0, parse_dates = True, encoding = 'ISO-8859-1', dtype = dtype)

Otherwise, at the moment to perform the addition I also tried:

tn = dfs[code]['TN'].astype(float)
tx = dfs[code]['TX'].astype(float)
formatted_dfs[code] = tn + tx

But I always get the same error:

ValueError: could not convert string to float.

I know I could do it row by row, testing if the value is nan, but I am pretty sure there is an easier way. Do you know how to do that? Or will I have to do it row by row? Thanks.

Upvotes: 1

Views: 6931

Answers (2)

Sudheer Muhammed
Sudheer Muhammed

Reputation: 893

Add the convert parameter in read method - converters={'TN':float,'TX':float}

dfs[code] = pd.read_csv(path, sep = ';',converters={'TN':float,'TX':float}, index_col = 0, parse_dates = True, encoding = 'ISO-8859-1', dtype = float)

Upvotes: 0

Micah Smith
Micah Smith

Reputation: 4471

You can see that if you allow pandas to detect dtypes itself, you avoid the ValueError and uncover the underlying problem.

In [4]: df = pd.read_csv(path, sep=';', index_col=0, parse_dates=True, low_memory=False)
In [5]: df
Out[5]:
Empty DataFrame
Columns: []
Index: [08/10/2015  0   10.5    19.5, 09/10/2015  0   5.5 20, 10/10/2015  0   5   24, 11/10/2015  0.5 7   24.5, 12/10/2015  3   12  23, 27/04/2017           , 28/04/2017           , 29/04/2017           , 30/04/2017           , 01/05/2017           , 02/05/2017           , 03/05/2017           , 04/05/2017   ]

It seems you specify the separator as ';' by accident, as your file is whitespace delimited. Since there aren't any semi-colons, entire rows are read into the index.

First, try reading in your file using the proper separator

df = pd.read_csv(path, delim_whitespace=True, index_col=0, parse_dates=True, low_memory=False)

Now, some of the rows have incomplete data. A simple solution conceptually is to try to convert values to np.float, and replace them with np.nan otherwise.

def f(x):
    try:
        return np.float(x)
    except:
        return np.nan

df["TN"] = df["TN"].apply(f)
df["TX"] = df["TX"].apply(f)

print(df.dtypes)

This returns, as desired

RR     object
TN    float64
TX    float64
dtype: object

Upvotes: 5

Related Questions