Bjc51192
Bjc51192

Reputation: 327

Subset list of dataframes by column name

So I have a list of dataframes df_list=[df1,df2,df3] and a list of column headers I am interested in. col_list=['Fire','Water','Wind','Hail']

I want to loop through each dataframe df_list, and create a new dataframe with only the columns in col_list.The issue is if one of the elements in col_list is not in the df I still want it to make the dataframe however simply without that column.

What I tried doing is,

for data_frame in df_list:
   try:
       data_frame=data_frame[['Fire','Water','Wind','Hail']]
   except:
        continue

However, this does not give the result I am looking for.

Upvotes: 2

Views: 58

Answers (2)

Andy Hayden
Andy Hayden

Reputation: 375475

You should use a list comprehension:

[data_frame[['Fire','Water','Wind','Hail']] for data_frame in df_list]

If some data_frames do not have all the columns you can use reindex instead:

[data_frame.reindex(columns=['Fire','Water','Wind','Hail']) for data_frame in df_list]

Inside the for loop:

data_frame=data_frame[['Fire','Water','Wind','Hail']]

is overwriting the data_frame variable BUT not updating the i-th item of df_list.
This is equivalent to the following code:

In [11]: a = [1, 2, 3]

In [12]: for i in a:
    ...:     i = i + 1
    ...:

In [13]: a
Out[13]: [1, 2, 3]

Upvotes: 1

aiguofer
aiguofer

Reputation: 2137

You could use list comprehensions to get the subset of cols that are in col_list. However, when you're iterating, the data_frame var only has a reference to the object, changing it won't actually change the element in the array. You could keep another list with the "sub dataframes".

sub_df_list = []
for data_frame in df_list:
    sub_df_list.append(
        data_frame[[col for col in data_frame.columns if col in col_list]]
    )

Edit:

As pointed out in another answer, you could do this as a single list comprehension... which is a bit hard on the eyes:

sub_df_list = [
    data_frame[[col for col in data_frame.columns if col in col_list]]
    for data_frame in df_list
]

Edit 2:

Pandas columns are an Index object. These have set operations, such as intersection. The easiest way to do what you're after is:

sub_df_list = [
    data_frame[data_frame.columns.intersection(col_list)] for data_frame in df_list
]

Upvotes: 1

Related Questions