Reputation: 567
I would like to remove columns having names with more than 3 characters or more than 2 characters repeated continuously (e.g., aaasdfa, there are three 'a').
I have tried with
df = df[df.columns[~df.columns.str.split().str.len().ge(4)]]
but it seems that 'Int64Index' object has no attribute 'ge'. I am thinking of using something like this
[x for x in column_name.split() if len(x) > 3]
But I have not a clear idea on how I should manage columns name. This would be helpful to select also columns with no repeated characters more than 3.
Upvotes: 0
Views: 129
Reputation: 365
You can use a groupby
— an iterator that returns consecutive keys and groups from the iterable (your column name is of type str
therefore iterable)
from itertools import groupby
def consecutive_ge(column_name, n):
max_cons_chars = max(sum(1 for _ in group) for _, group in groupby(column_name))
return max_cons_chars >= n
# Or n=3
valid_cols = [not consecutive_ge(col, n=2) for col in df.columns]
df = df[df.columns[valid_cols]]
Upvotes: 2