Reputation: 23
I have a dataframe with more than 400 columns, I'm trying to select a sub-df with about half the columns based on some conditions. I have already stored the filtered columns as a list to hopefully use a for loop to iterate through them and select for the new df but I keep only getting the last column in the list.
My list has the 200 filtered columns. I used the following for loop:
for i in list:
df1 = df[["col1", "col2"]]
df2 = df[[i]]
df1 = df1.join(df2)
My final result should consist of "col1", "col2" and the subsequent 200 columns but the output I keep getting is 3 columns, "col1", "col2", and the 200th column.
Upvotes: 1
Views: 597
Reputation: 26
This can be done by indexing in Pandas (here's the documentation). Specifically, you can filter within the dataframe in one step without having to loop through the dataframe.
The general format is
df[filter condition here]
As an example, let's say we want to find all columns in the following dataframe that contain a 5:
d = {'col1': [1, 2, 3], 'col2': [4, 5, 6], 'col3': [7, 5, 9], 'col4': [10, 11, 12], 'col5': [13, 5, 15]}
df = pd.DataFrame(data=d)
df.head()
Then we add the filter condition in the dataframe and see the output:
filter_condition = df.columns[df.isin([5]).any()]
print(filter_condition)
new_df = df[filter_condition]
df.head()
If know the start and end columns, you can use the :
operator to choose all columns between them. For example, if you wanted to choose all columns between the 1st and 5th columns, you can use
df.iloc[:, 1:4]
to get
Upvotes: 0
Reputation: 260300
You should never join columns repeatedly. This is inefficient and will fragment the DataFrame.
Assuming your list is named lst
, you should just do:
out = df[['col1', 'col2']+lst]
Your code failed since you're overwriting df1
at each step. This would have worked, but this is really not a good approach:
df1 = df[["col1", "col2"]]
for i in lst:
df2 = df[[i]]
df1 = df1.join(df2)
Upvotes: 0