Pandas count number of rows in columns that start with upper case of are of specific length

Question

I have a dataframe:

data = [['male', 'A', NaN, something, something], ['Female', 'C', something, something, NaN], ['Male', 'B', something, something, something], ['female', 'Ca', something, something, NaN]] 

df = pd.DataFrame(data, columns = ['Gender', 'Group', 'X', 'Y', 'Z'])

I want to be able to create another data frame which calculates 4 things across both the Gender and Group columns and 0 for other columns: the number of missing values, the percentage of those missing values, the number of columns that have an uppercase gender or a group longer than one letter, and the percentage of those.

I can do the first part:

missing_values = df.isnull().sum()
percent_missing = df.isnull().sum() * 100 / len(df)

missing_values_df = pd.DataFrame({'Features': df.columns,
                                  'Missing Values (MV)': missing_values,
                                  '% of MV (MV/n)': percent_missing})

How would I calculate how many are incorrectly upper case gender and have more than one letter as a group?

I know that the following calculates the number of rows that have a gender that starts with an upper case, but I can't seem to include that into my missing_values_df.

df['Uppercase'] = df['Gender'].str.findall(r'[A-Z]').str.len()

My expected output based on this example would be:

Features, Missing Values, % of missing values, wrong case or group length, % of wrong data

gender         0                    0                         2            0.5
group          0                    0                         1            0.25
X              1                    0.25                      0            0
Y              0                    0                         0            0
Z              2                    0.5                       0            0

Edit - more things I've tried:

missing_values_df['wrong case or group length'] = (titanic_data.Gender.apply(lambda x: x.islower() == False).astype(int).sum())

This calculates the sum for the gender column but includes it as a row for every column in the missing values df. Is there a way to only associate this with the gender row and everything else stay as 0?

Pandas count number of rows in columns that start with upper case of are of specific length

Answers (1)

Related Questions