Reputation: 1093
I have a dataframe:
data = [['male', 'A', NaN, something, something], ['Female', 'C', something, something, NaN], ['Male', 'B', something, something, something], ['female', 'Ca', something, something, NaN]]
df = pd.DataFrame(data, columns = ['Gender', 'Group', 'X', 'Y', 'Z'])
I want to be able to create another data frame which calculates 4 things across both the Gender and Group columns and 0 for other columns: the number of missing values, the percentage of those missing values, the number of columns that have an uppercase gender or a group longer than one letter, and the percentage of those.
I can do the first part:
missing_values = df.isnull().sum()
percent_missing = df.isnull().sum() * 100 / len(df)
missing_values_df = pd.DataFrame({'Features': df.columns,
'Missing Values (MV)': missing_values,
'% of MV (MV/n)': percent_missing})
How would I calculate how many are incorrectly upper case gender and have more than one letter as a group?
I know that the following calculates the number of rows that have a gender that starts with an upper case, but I can't seem to include that into my missing_values_df
.
df['Uppercase'] = df['Gender'].str.findall(r'[A-Z]').str.len()
My expected output based on this example would be:
Features, Missing Values, % of missing values, wrong case or group length, % of wrong data
gender 0 0 2 0.5
group 0 0 1 0.25
X 1 0.25 0 0
Y 0 0 0 0
Z 2 0.5 0 0
Edit - more things I've tried:
missing_values_df['wrong case or group length'] = (titanic_data.Gender.apply(lambda x: x.islower() == False).astype(int).sum())
This calculates the sum for the gender column but includes it as a row for every column in the missing values df. Is there a way to only associate this with the gender row and everything else stay as 0?
Upvotes: 0
Views: 387
Reputation: 6483
First thing first.
I would use str.islower()
instead of str.findall(r'[A-Z]').str.len()
as the latter would return n
in case you have n
CAPS in one string.
That said, below is a step by step solution that will make it easier to follow the logic. You can easily wrap it in fewer lines of code:
data = [['male', 'A'], ['Female', 'C'], ['Male', 'B'], ['female', 'Ca']]
df = pd.DataFrame(data, columns = ['Gender', 'Group'])
#df['Uppercase'] = (df['Gender'].str.islower()==False).astype(int) #in case you need this
#df['length']= (df['Group'].str.len()>1).astype(int) #in case you need this
#Here we follow your logic and simply complete the df with the last two columns
missing_values = df.isnull().sum()
percent_missing = df.isnull().sum() * 100 / len(df)
missing_values_df = pd.DataFrame({'Features': df.columns.tolist(),
'Missing Values (MV)': missing_values,
'% of MV (MV/n)': percent_missing,
'wrong case or group length':0,
'% of wrong data':0})
#create a list
new_col=[(df['Gender'].str.islower()==False).astype(int).sum()]
#add an element to the list
new_col.extend([(df['Group'].str.len()>1).astype(int).sum()])
# the final column is simply the previous list divided by len(df)
new_col_ratio= [x / len(df) for x in new_col]
# add the columns to the missing_values_df dataframe
missing_values_df['wrong case or group length']=new_col
missing_values_df['% of wrong data']=new_col_ratio
The following modification works with OP updated MWE:
missing_values_df.loc[missing_values_df['Features']=='Gender','wrong case or group length']=(df['Gender'].str.islower()==False).astype(int).sum()
missing_values_df.loc[missing_values_df['Features']=='Group','wrong case or group length']=(df['Group'].str.len()>1).astype(int).sum()
new_col_ratio= [x / len(df) for x in missing_values_df['wrong case or group length']]
missing_values_df['% of wrong data']=new_col_ratio
Upvotes: 1