I do not understand the behavior of pandas.drop, since I get different results from dropna (too many rows are dropped)

Question

I have a DataFrame with some NA and I want to drop the rows where a particular column has NA values.

My first trial has been: - Identifying the rows where the specific column values were NA - Pass them to pandas.drop()

In my specific case, I have a DataFrame of 39164 rows by 40 columns. If I look to NA in the specific columns I found 17715 concerned labels that I saved to a dedicated variable. I then sent them to pandas.drop() expecting about 22000 rows remaining, but I only got 2001. If I use pandas.dropna() I get 21449 rows remaining, which is what I was expecting.

Here follows my code. The first code portion downloads data from gouv.fr (sorry for not using fake data...but it will only take less than 10 seconds to execute). WARNING: only the last 5 years are stored on the online database. So my example should be adapted later...

import pandas as pd

villes = {'Versailles' : '78646',
          'Aix-en-Provence' : '13001'}
years = range(2014,2019)

root = "https://cadastre.data.gouv.fr/data/etalab-dvf/latest/csv/"
data = pd.DataFrame({})

for ville in villes.keys() :
    for year in years :
        file_addr = '/'.join([root,str(year),'communes',villes[ville][:2],villes[ville]+'.csv'])
        print(file_addr)
        tmp = pd.read_csv(file_addr)
        data =pd.concat([data,tmp])

This is the second portion, where I try to drop some rows. As said, the results are very different depending on the chosen strategy (data_1 vs data_2). data_2, obtained by dropna() is the expected results.

print(data.shape)
undefined_surface = data.index[data.surface_reelle_bati.isna()]
print(undefined_surface)
data_1 = data.drop(undefined_surface)
data_2 = data.dropna(subset=['surface_reelle_bati'])
print(data_1.shape)
print(data_2.shape)

Using dropna() is totally fine for me, but I would like to understand what I was doing wrong with drop() since I got a very silly result compared to my expectation and I would like to be aware of that in the future... Thanks in advance for help.

fmarm · Accepted Answer

This is because your index is not unique, look for example for index 0, you have forty rows with this index

data_idx0 = data.iloc[0]
data_idx0.shape
# (40,)

If at least one of the rows with index 0 has surface_reelle_bati missing, all the forty rows will disappear from data_1. This is why you drop more rows when creating data_1 than when creating data_2.

To solve this, use reset_index()to get index go from 0 to the number of rows of data

data = data.reset_index()
undefined_surface = data.index[data.surface_reelle_bati.isna()].tolist()
data_1 = data.drop(undefined_surface)
print(data_1.shape)
# (21449, 41)
data_2 = data.dropna(subset=['surface_reelle_bati'])
print(data_2.shape)
# (21449, 41)

I do not understand the behavior of pandas.drop, since I get different results from dropna (too many rows are dropped)

Answers (1)

Related Questions