Reputation: 529
I am generating a pivot table report using the pandas Python module. The source data includes a lot of readings measured in milliseconds. If the number of milliseconds exceeds 999 then the value in that CSV file will include commas (e.g. 1,234 = 1.234 seconds).
Here's how I'm trying to run the report:
import pandas as pd
import numpy as np
pool_usage = pd.read_csv("c:/foo/ds-dump.csv")
# Add a column to the end that shows you where the data came from
pool_usage["Source File"] = "ds-dump.csv"
report = pool_usage.pivot_table(values=['Average Pool Size', 'Average Usage Time (ms)'], index=['Source File'], aggfunc=np.max)
print(report)
The problem is that the dtype for the Average Usage Time (ms) is an object so the np.max
function just treats it like it's NaN. I therefore never see any values greater than 999.
I tried fixing the issue like this:
import pandas as pd
import numpy as np
pool_usage = pd.read_csv("c:/foo/ds-dump.csv")
# Add a column to the end that shows you where the data came from
pool_usage["Source File"] = "ds-dump.csv"
# Convert strings to numbers if possible
pool_usage = pool_usage.convert_objects(convert_numeric=True)
report = pool_usage.pivot_table(values=['Average Pool Size', 'Average Usage Time (ms)'], index=['Source File'], aggfunc=np.max)
print(report)
This did actually change the dtype of the Average Usage Time column to a float but all of the values that are greater than 999 are still treated like NaN's.
How can I convert the Average Usage Time column to a float even though it's possible that some of the values may include commas?
Upvotes: 4
Views: 5097
Reputation: 176780
The read_csv
function takes an optional thousands
argument. Its default is None
so you can change it to ","
to have it recognise 1,234
as 1234
when it reads the file:
pd.read_csv("c:/foo/ds-dump.csv", thousands=",")
The column holding the millisecond values should then have the int64
datatype once the file has been read into memory.
Upvotes: 10