Reputation: 651
thanks for your help.
I have a large csv pseuodo-patient dataset, that generally it looks like this.
Average Covered Charges , Total Discharges , Average Medicare Payments
32963.07 , 91 , 4763.73
15131.85 , 14 , 4976.71
37560.37 , ,
The problem that I am facing is that I am trying to get the total number of discharges greater than 100, ... BUT because one of the columns has one more value at the end than all the other cells, the last cell of the other columns just has empty space which is being counted as an item in the length of the dataset. I've loaded this csv file into a pandas dataframe and I've removed the NAs, and my dataset still looks like this.
So NAs, None, etc ... is not the problem. It's an empty cell which is making the last two columns have the datatype "object", and because it has a blank cell/empty space, it is impossible to change the "object" datatype to "int"
If I could change my dataframe to look like this by removing all rows in the dataset for which there are any empty cells in any of the the columns then all my problems would go away. So I would like my dataset to look like this:
Average Covered Charges , Total Discharges , Average Medicare Payments
32963.07 , 91 , 4763.73
15131.85 , 14 , 4976.71
I'm surprised that I'm having such a hard time doing this. But from what I've read about Pandas, the documentation assumes that those are NAs and so I can just drop them, but I can't get Pandas to believe that at all.
import pandas as pd
inpatients_df = pd.read_csv("inpatient_payment_system.csv", low_memory = False)
inpatients_df.dropna(axis = 0, how = 'any') # Remove NA values.
I've already loaded in the dataset and removed all NAs, not like it matters.
Note: There are much more than 3 columns in this dataset. It just takes one column to have more entres than the other columns for this to happen, and the issue is actually at the last row of the csv file.
Upvotes: 2
Views: 1777
Reputation: 12417
Try this:
inpatients_df.dropna(subset=['Total Discharges'], how='all', inplace = True)
Upvotes: 2
Reputation: 3130
For this particular case, you should use the .isnumeric()
method to get the appropriate slice:
df[df['Total Discharges'].str.isnumeric()]
But to answer your question: chances are, those entries you're calling empty are probably empty strings, so something like
df[df['Total Discharges'] != '']
should do the trick, although you might want to explicitly check the value in that cell to be sure.
Upvotes: 1
Reputation: 2981
Just get one of the columns as a series, and work with that:
pd.to_numeric(df['Total Discharges'], errors='coerce').dropna() > 100).sum()
Upvotes: 1