xyz123
xyz123

Reputation: 651

Removing the rows for which cells are empty in Pandas?

thanks for your help.

I have a large csv pseuodo-patient dataset, that generally it looks like this.

 Average Covered Charges ,  Total Discharges  , Average Medicare Payments
        32963.07         ,          91        ,           4763.73 
        15131.85         ,          14        ,           4976.71 
        37560.37         ,                    ,          

The problem that I am facing is that I am trying to get the total number of discharges greater than 100, ... BUT because one of the columns has one more value at the end than all the other cells, the last cell of the other columns just has empty space which is being counted as an item in the length of the dataset. I've loaded this csv file into a pandas dataframe and I've removed the NAs, and my dataset still looks like this.

So NAs, None, etc ... is not the problem. It's an empty cell which is making the last two columns have the datatype "object", and because it has a blank cell/empty space, it is impossible to change the "object" datatype to "int"

If I could change my dataframe to look like this by removing all rows in the dataset for which there are any empty cells in any of the the columns then all my problems would go away. So I would like my dataset to look like this:

 Average Covered Charges  , Total Discharges  , Average Medicare Payments
        32963.07          ,          91       ,           4763.73 
        15131.85          ,          14       ,           4976.71 

I'm surprised that I'm having such a hard time doing this. But from what I've read about Pandas, the documentation assumes that those are NAs and so I can just drop them, but I can't get Pandas to believe that at all.

    import pandas as pd
    inpatients_df = pd.read_csv("inpatient_payment_system.csv", low_memory = False)

    inpatients_df.dropna(axis = 0, how = 'any') # Remove NA values.

I've already loaded in the dataset and removed all NAs, not like it matters.

Note: There are much more than 3 columns in this dataset. It just takes one column to have more entres than the other columns for this to happen, and the issue is actually at the last row of the csv file.

Upvotes: 2

Views: 1777

Answers (3)

Joe
Joe

Reputation: 12417

Try this:

inpatients_df.dropna(subset=['Total Discharges'], how='all', inplace = True)

Upvotes: 2

Ken Wei
Ken Wei

Reputation: 3130

For this particular case, you should use the .isnumeric() method to get the appropriate slice:

df[df['Total Discharges'].str.isnumeric()]

But to answer your question: chances are, those entries you're calling empty are probably empty strings, so something like

df[df['Total Discharges'] != '']

should do the trick, although you might want to explicitly check the value in that cell to be sure.

Upvotes: 1

Sebastian Mendez
Sebastian Mendez

Reputation: 2981

Just get one of the columns as a series, and work with that:

pd.to_numeric(df['Total Discharges'], errors='coerce').dropna() > 100).sum()

Upvotes: 1

Related Questions