Niccola Tartaglia
Niccola Tartaglia

Reputation: 1667

subsetting pandas dataframe

I have found an inconsistency (at least to me) in the following two approaches:

For a dataframe defined as:

df=pd.DataFrame([[1,2,3,4,np.NaN],[8,2,0,4,5]])

I would like to access the element in the 1st row, 4th column (counting from 0). I either do this:

df[4][1]
Out[94]: 5.0

Or this:

df.iloc[1,4]
Out[95]: 5.

Am I correctly understanding that in the first approach I need to use the column first and then the rows, and vice versa when using iloc? I just want to make sure that I use both approaches correctly going forward.

EDIT: Some of the answers below have pointed out that the first approach is not as reliable, and I see now that this is why:

df.index = ['7','88']
df[4][1]
Out[101]: 5.0

I still get the correct result. But using int instead, will raise an exception if that corresponding number is not there anymore:

df.index = [7,88]
df[4][1]   
KeyError: 1

Also, changing the column names:

df.columns = ['4','5','6','1','5']
df['4'][1]
Out[108]: 8

Gives me a different result. So overall, I should stick to iloc or loc to avoid these issues.

Upvotes: 1

Views: 245

Answers (2)

gepcel
gepcel

Reputation: 1356

Unfortunately, you are not using them correctly. It's just coincidence you get the same result.

df.loc[i, j] means the element in df with the row named i and the column named j

Besides many other defferences, df[j] means the column named j, and df[j][i] menas the column named j, and the element (which is row here) named i.

df.iloc[i, j] means the element in the i-th row and the j-th column started from 0.

So, df.loc select data by label (string or int or any other format, int in this case), df.iloc select data by position. It's just coincidence that in your example, the i-th row named i.

For more details you should read the doc

Update:

Think of df[4][1] as a convenient way. There are some logic background that under most circumstances you'll get what you want.

In fact

df.index = ['7', '88']
df[4][1]

works because the dtype of index is str. And you give an int 1, so it will fall back to position index. If you run:

df.index = [7, 88]
df[4][1]

Will raise an error. And

df.index = [1, 0]
df[4][1]

Sill won't be the element you expect. Because it's not the 1st row starts from 0. It will be the row with the name 1

Upvotes: 2

FatihAkici
FatihAkici

Reputation: 5109

You should think of DataFrames as a collection of columns. Therefore when you do df[4] you get the 4th column of df, which is of type Pandas Series. Afer this when you do df[4][1] you get the 1st element of this Series, which corresponds to the 1st row and 4th column entry of the DataFrame, which is what df.iloc[1,4] does exactly.

Therefore, no inconsistency at all, but beware: This will work only if you don't have any column names, or if your column names are [0,1,2,3,4]. Else, it will either fail or give you a wrong result. Hence, for positional indexing you must stick with iloc, or loc for name indexing.

Upvotes: 2

Related Questions