Reputation: 8954
I had thought that a Pandas DataFrame was basically represented as a collection of columns. That is, I thought the following two lines of code would produce the same lists of Series (for some arbitrary DataFrame df
):
list1 = [item for item in df]
list2 = [df[col_name] for col_name in df.columns]
But apparently they're very different; treating the df
like an iteratable and stepping through it is exactly the same as stepping through df.columns
, which of course is just a list of column names:
df = pd.DataFrame({'col_1': [1,2,3,4,5], 'col_2':[5,6,7,8,9]})
for a, b in zip(df, df.columns):
print(a,b, type(a), type(b), a==b)
outputs:
col_1 col_1 <class 'str'> <class 'str'> True
col_2 col_2 <class 'str'> <class 'str'> True
Why is this? This seems very unintuitive to me.
(To be clear: I'm not asking how to get a list of the columns in a DataFrame, or how to step through the columns of a DataFrame.)
Upvotes: 1
Views: 90
Reputation: 19947
When you try to iterate a df directly like:
[item for item in df]
You are calling the df.__iter__() method which in turn calls the df._info_axis attribute and then the df._info_axis_name attribute which for Dataframe is the list of column names.
While when you call df[col_name], you are slicing the column of the dataframe.
Upvotes: 1