Data cleaning: columns name with a certain number of characters

Question

I would like to remove columns having names with more than 3 characters or more than 2 characters repeated continuously (e.g., aaasdfa, there are three 'a').

I have tried with

df = df[df.columns[~df.columns.str.split().str.len().ge(4)]]

but it seems that 'Int64Index' object has no attribute 'ge'. I am thinking of using something like this

[x for x in column_name.split() if len(x) > 3]

But I have not a clear idea on how I should manage columns name. This would be helpful to select also columns with no repeated characters more than 3.

Yulian · Accepted Answer

You can use a groupby — an iterator that returns consecutive keys and groups from the iterable (your column name is of type str therefore iterable)

from itertools import groupby

def consecutive_ge(column_name, n):
    max_cons_chars = max(sum(1 for _ in group) for _, group in groupby(column_name))
    return max_cons_chars >= n

# Or n=3
valid_cols = [not consecutive_ge(col, n=2) for col in df.columns]

df = df[df.columns[valid_cols]]

Data cleaning: columns name with a certain number of characters

Answers (1)

Related Questions