statsguy
statsguy

Reputation: 123

How to generate a Python Pivot Table of counts of strings in the Pandas cells?

I am having trouble creating a pivot table in Python 3.7.3 of counts of strings found within a dataframe (df1) and aligning the counts to columns of the string values in another dataframe (df2). How do I go about filling in my second dataframe with the total counts of the column headers (the strings) based on the values in df1?

I have a dataframe (df1) of 145 columns. They are named 'Unnamed: 0' - 'Unnamed: 144'. Within 'Unnamed: 0' is a non-unique list of names and then every column beyond contains string values that repeat all throughout the dataframe. I've been able to create a second dataframe (df2) that contains a 'name' column with unique name values from 'Unnamed: 0'. It also contains about 80 other columns which are the unique values found in 'Unnamed: 1' - 'Unnamed: 144' of df1. Below you will see my steps as well as sample dataframes (since I'm not going post a 145 column df) of what I'm working with and my expected output. Any help is always appreciated.

#imported df1 using pandas
df1 = pd.read_excel('path/to/file')

## Creation of df2
#created a temporary df
tmp_df = df1.drop(columns = ['Unnamed: 0'])

array = tmp_df['Unnamed: 1'] #created array to find unique values from df1
for col in tmp_df.columns:
    array = array.append(tmp_df[col]).reset_index(drop = True)

#created unique value arrays for name and the df1 values
array2 = array.unique()
names = df['Unnamed: 0'].unique()

#Create temp dfs to build df2
names_df = pd.DataFrame(names)
names.columns = ['name']
strvals_df = pd.DataFrame(columns = array2)
df2 = pd.concat([names_df.reset_index(drop=True), strvals_df], axis=1)
df2 = df2.loc[:, df2.columns.notnull()]

Example dataframes:

df1
Unnamed: 0  Unnamed: 1  Unnamed: 2  Unnamed: 3  Unnamed: 4  Unnamed: 5
name1       str1        str1        str2        str2        str3
name1       str4        NaN         str3        str1        str4
name1       str1        str1        str2        str2        str3
name2       str4        NaN         str3        str1        str4
name3       str1        str1        str2        str2        str3
name3       str4        NaN         str3        str1        str4
name4       str1        str1        str2        str2        str3

df2 (current)
name    str1    str2    str3    str4
name1   NaN     NaN     NaN     NaN
name2   NaN     NaN     NaN     NaN
name3   NaN     NaN     NaN     NaN   
name4   NaN     NaN     NaN     NaN 

df2 (desired output, counts of str values from df1 in columns)
name    str1    str2    str3    str4
name1   5       4       3       2
name2   1       0       1       2
name3   3       2       2       2   
name4   2       2       1       0

I've tried looping through the dataframes in conjunction with nested for loops but that didn't work. I also tried the pivot_table command for pandas but I keep getting a table that only counts the number of times a name shows up. I've even tried making two huge repeating arrays in index order and using the pivot_table function but that didn't work either.

Upvotes: 1

Views: 1334

Answers (2)

Andy L.
Andy L.

Reputation: 25259

It is simple with str.get_dummies:

df.set_index('Unnamed: 0').stack().str.get_dummies().sum(level=0)


Out[1515]:
                str1  str2  str3  str4
Unnamed: 0
name1          5     4     3     2
name2          1     0     1     2
name3          3     2     2     2
name4          2     2     1     0

Upvotes: 0

user3483203
user3483203

Reputation: 51165

stack first, you don't need df2


m = (df1.set_index('Unnamed: 0').stack()
        .rename_axis(['names', 'values']).rename('columns').reset_index())

m.pivot_table('values', 'names', 'columns', aggfunc='count', fill_value=0)

columns  str1  str2  str3  str4
names
name1       5     4     3     2
name2       1     0     1     2
name3       3     2     2     2
name4       2     2     1     0

Upvotes: 1

Related Questions