user47467
user47467

Reputation: 1093

Pandas count number of rows in columns that start with upper case of are of specific length

I have a dataframe:

data = [['male', 'A', NaN, something, something], ['Female', 'C', something, something, NaN], ['Male', 'B', something, something, something], ['female', 'Ca', something, something, NaN]] 

df = pd.DataFrame(data, columns = ['Gender', 'Group', 'X', 'Y', 'Z']) 

I want to be able to create another data frame which calculates 4 things across both the Gender and Group columns and 0 for other columns: the number of missing values, the percentage of those missing values, the number of columns that have an uppercase gender or a group longer than one letter, and the percentage of those.

I can do the first part:

missing_values = df.isnull().sum()
percent_missing = df.isnull().sum() * 100 / len(df)

missing_values_df = pd.DataFrame({'Features': df.columns,
                                  'Missing Values (MV)': missing_values,
                                  '% of MV (MV/n)': percent_missing})

How would I calculate how many are incorrectly upper case gender and have more than one letter as a group?

I know that the following calculates the number of rows that have a gender that starts with an upper case, but I can't seem to include that into my missing_values_df.

df['Uppercase'] = df['Gender'].str.findall(r'[A-Z]').str.len()

My expected output based on this example would be:

Features, Missing Values, % of missing values, wrong case or group length, % of wrong data

gender         0                    0                         2            0.5
group          0                    0                         1            0.25
X              1                    0.25                      0            0
Y              0                    0                         0            0
Z              2                    0.5                       0            0

Edit - more things I've tried:

missing_values_df['wrong case or group length'] = (titanic_data.Gender.apply(lambda x: x.islower() == False).astype(int).sum())

This calculates the sum for the gender column but includes it as a row for every column in the missing values df. Is there a way to only associate this with the gender row and everything else stay as 0?

Upvotes: 0

Views: 387

Answers (1)

CAPSLOCK
CAPSLOCK

Reputation: 6483

First thing first. I would use str.islower() instead of str.findall(r'[A-Z]').str.len() as the latter would return n in case you have n CAPS in one string.
That said, below is a step by step solution that will make it easier to follow the logic. You can easily wrap it in fewer lines of code:

data = [['male', 'A'], ['Female', 'C'], ['Male', 'B'], ['female', 'Ca']] 

df = pd.DataFrame(data, columns = ['Gender', 'Group']) 

#df['Uppercase'] = (df['Gender'].str.islower()==False).astype(int) #in case you need this
#df['length']= (df['Group'].str.len()>1).astype(int)               #in case you need this

#Here we follow your logic and simply complete the df with the last two columns

missing_values = df.isnull().sum()
percent_missing = df.isnull().sum() * 100 / len(df)

missing_values_df = pd.DataFrame({'Features': df.columns.tolist(),
                                     'Missing Values (MV)': missing_values,
                                     '% of MV (MV/n)': percent_missing,
                                     'wrong case or group length':0,
                                     '% of wrong data':0})


#create a list
new_col=[(df['Gender'].str.islower()==False).astype(int).sum()]
#add an element to the list
new_col.extend([(df['Group'].str.len()>1).astype(int).sum()])

# the final column is simply the previous list divided by len(df)
new_col_ratio= [x / len(df) for x in new_col]

# add the columns to the missing_values_df dataframe
missing_values_df['wrong case or group length']=new_col
missing_values_df['% of wrong data']=new_col_ratio

The following modification works with OP updated MWE:

missing_values_df.loc[missing_values_df['Features']=='Gender','wrong case or group length']=(df['Gender'].str.islower()==False).astype(int).sum()

missing_values_df.loc[missing_values_df['Features']=='Group','wrong case or group length']=(df['Group'].str.len()>1).astype(int).sum()

new_col_ratio= [x / len(df) for x in missing_values_df['wrong case or group length']]

missing_values_df['% of wrong data']=new_col_ratio

Upvotes: 1

Related Questions