Reputation: 123
I am having trouble creating a pivot table in Python 3.7.3 of counts of strings found within a dataframe (df1) and aligning the counts to columns of the string values in another dataframe (df2). How do I go about filling in my second dataframe with the total counts of the column headers (the strings) based on the values in df1?
I have a dataframe (df1) of 145 columns. They are named 'Unnamed: 0' - 'Unnamed: 144'. Within 'Unnamed: 0' is a non-unique list of names and then every column beyond contains string values that repeat all throughout the dataframe. I've been able to create a second dataframe (df2) that contains a 'name' column with unique name values from 'Unnamed: 0'. It also contains about 80 other columns which are the unique values found in 'Unnamed: 1' - 'Unnamed: 144' of df1. Below you will see my steps as well as sample dataframes (since I'm not going post a 145 column df) of what I'm working with and my expected output. Any help is always appreciated.
#imported df1 using pandas
df1 = pd.read_excel('path/to/file')
## Creation of df2
#created a temporary df
tmp_df = df1.drop(columns = ['Unnamed: 0'])
array = tmp_df['Unnamed: 1'] #created array to find unique values from df1
for col in tmp_df.columns:
array = array.append(tmp_df[col]).reset_index(drop = True)
#created unique value arrays for name and the df1 values
array2 = array.unique()
names = df['Unnamed: 0'].unique()
#Create temp dfs to build df2
names_df = pd.DataFrame(names)
names.columns = ['name']
strvals_df = pd.DataFrame(columns = array2)
df2 = pd.concat([names_df.reset_index(drop=True), strvals_df], axis=1)
df2 = df2.loc[:, df2.columns.notnull()]
Example dataframes:
df1
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5
name1 str1 str1 str2 str2 str3
name1 str4 NaN str3 str1 str4
name1 str1 str1 str2 str2 str3
name2 str4 NaN str3 str1 str4
name3 str1 str1 str2 str2 str3
name3 str4 NaN str3 str1 str4
name4 str1 str1 str2 str2 str3
df2 (current)
name str1 str2 str3 str4
name1 NaN NaN NaN NaN
name2 NaN NaN NaN NaN
name3 NaN NaN NaN NaN
name4 NaN NaN NaN NaN
df2 (desired output, counts of str values from df1 in columns)
name str1 str2 str3 str4
name1 5 4 3 2
name2 1 0 1 2
name3 3 2 2 2
name4 2 2 1 0
I've tried looping through the dataframes in conjunction with nested for loops but that didn't work. I also tried the pivot_table command for pandas but I keep getting a table that only counts the number of times a name shows up. I've even tried making two huge repeating arrays in index order and using the pivot_table function but that didn't work either.
Upvotes: 1
Views: 1334
Reputation: 25259
It is simple with str.get_dummies
:
df.set_index('Unnamed: 0').stack().str.get_dummies().sum(level=0)
Out[1515]:
str1 str2 str3 str4
Unnamed: 0
name1 5 4 3 2
name2 1 0 1 2
name3 3 2 2 2
name4 2 2 1 0
Upvotes: 0
Reputation: 51165
stack
first, you don't need df2
m = (df1.set_index('Unnamed: 0').stack()
.rename_axis(['names', 'values']).rename('columns').reset_index())
m.pivot_table('values', 'names', 'columns', aggfunc='count', fill_value=0)
columns str1 str2 str3 str4
names
name1 5 4 3 2
name2 1 0 1 2
name3 3 2 2 2
name4 2 2 1 0
Upvotes: 1