subsetting pandas dataframe

Question

I have found an inconsistency (at least to me) in the following two approaches:

For a dataframe defined as:

df=pd.DataFrame([[1,2,3,4,np.NaN],[8,2,0,4,5]])

I would like to access the element in the 1st row, 4th column (counting from 0). I either do this:

df[4][1]
Out[94]: 5.0

Or this:

df.iloc[1,4]
Out[95]: 5.

Am I correctly understanding that in the first approach I need to use the column first and then the rows, and vice versa when using iloc? I just want to make sure that I use both approaches correctly going forward.

EDIT: Some of the answers below have pointed out that the first approach is not as reliable, and I see now that this is why:

df.index = ['7','88']
df[4][1]
Out[101]: 5.0

I still get the correct result. But using int instead, will raise an exception if that corresponding number is not there anymore:

df.index = [7,88]
df[4][1]   
KeyError: 1

Also, changing the column names:

df.columns = ['4','5','6','1','5']
df['4'][1]
Out[108]: 8

Gives me a different result. So overall, I should stick to iloc or loc to avoid these issues.

FatihAkici · Accepted Answer

You should think of DataFrames as a collection of columns. Therefore when you do df[4] you get the 4th column of df, which is of type Pandas Series. Afer this when you do df[4][1] you get the 1st element of this Series, which corresponds to the 1st row and 4th column entry of the DataFrame, which is what df.iloc[1,4] does exactly.

Therefore, no inconsistency at all, but beware: This will work only if you don't have any column names, or if your column names are [0,1,2,3,4]. Else, it will either fail or give you a wrong result. Hence, for positional indexing you must stick with iloc, or loc for name indexing.

subsetting pandas dataframe

Answers (2)

Related Questions