SMS
SMS

Reputation: 382

Count combination of index pairs in a pivot table

I have a following question, hoping you can support me. Suppose you have following frame

df

  Name  Number  File_name  Frequency
0    A  item 1     path 1          2
1    A  item 1     path 2          2
2    A  item 2     path 1          4
3    A  item 2     path 2          4
4    A  item 3     path 1          1
5    A  item 2     path 3          4
6    A  item 2     path 4          4

my goal is to create a pivot table with following shape

df= Name      Number    File_name      Frequency   
    A         item 1     path 1            2 
                         path 2
              item 2     path 1            4 
                         path 2
                         path 3
                         path 4
              item 3     path 1            1

I tried the following way, since I do not know how to count the combination of an index pair for a multi-dimensional pivot table

df=pd.read_excel(path)
df["Unique ID"]=df["Name"]+" "+df["Number"] #Create an additional column to calculate frequency
df['frequency'] = df['Unique ID'].map(df['Unique ID'].value_counts())

pivot=pd.pivot_table(df,index=["Name",
                                "Number","File_name"],
                     values="frequency",fill_value=0)

print(pivot)

pivot= Name      Number    File_name      Frequency   
    A         item 1     path 1            2 
                         path 2            2
              item 2     path 1            4 
                         path 2            4
                         path 3            4
                         path 4            4
              item 3     path 1            1

My question is: How can I, should I change my code in order to get the desired outcome, i.e. not to have the frequency values as duplicates, but rather also as an index with unique values

Upvotes: 0

Views: 182

Answers (3)

SomeDude
SomeDude

Reputation: 14238

You could use:

df = df.sort_values(df.columns.drop('Frequency').tolist(), ignore_index=True)
df2 = (df == df.shift()).replace(True, '')
df2 = df2.mask(df2 != '').fillna(df)

print(df2):

  Name  Number File_name Frequency
0    A  item 1    path 1         2
1                 path 2          
2       item 2    path 1         4
3                 path 2          
4                 path 3          
5                 path 4          
6       item 3    path 1         1

Explanation:

a) df2 = (df == df.shift()) which gives:

    Name  Number  File_name  Frequency
0  False   False      False      False
1   True    True      False       True
2   True   False      False      False
3   True    True      False       True
4   True    True      False       True
5   True    True      False       True
6   True   False      False      False

b) Replace True values with empty string ''

df2 = df2.replace(True, '') which gives:

    Name Number  File_name Frequency
0  False  False      False     False
1                    False          
2         False      False     False
3                    False          
4                    False          
5                    False          
6         False      False     False

c) Mask non-empty values with NaN

df2.mask(df2 != '')

  Name Number  File_name Frequency
0  NaN    NaN        NaN       NaN
1                    NaN          
2         NaN        NaN       NaN
3                    NaN          
4                    NaN          
5                    NaN          
6         NaN        NaN       NaN

d) fillna of this df with the original one

df2.fillna(df)

  Name  Number File_name Frequency
0    A  item 1    path 1         2
1                 path 2          
2       item 2    path 1         4
3                 path 2          
4                 path 3          
5                 path 4          
6       item 3    path 1         1

Upvotes: 0

BeRT2me
BeRT2me

Reputation: 13242

What you want doesn't make a lot of sense, if you want something to act like an index... make it an index, otherwise you're just manually creating a non-functional, visually-pleasing, output. (Which Mustafa does a good job of executing.)

df['blank'] = np.nan
df = df.set_index(['Name', 'Number', 'Frequency', 'File_name']).sort_index()
print(df)

# Output:
                                 blank
Name Number Frequency File_name
A    item 1 2         path 1       NaN
                      path 2       NaN
     item 2 4         path 1       NaN
                      path 2       NaN
                      path 3       NaN
                      path 4       NaN
     item 3 1         path 1       NaN

Upvotes: 1

Mustafa Aydın
Mustafa Aydın

Reputation: 18315

i start with this frame

In [485]: df
Out[485]:
  Name Number Filename
0    A  item1    path1
1    A  item1    path2
2    A  item2    path1
3    A  item2    path2
4    A  item3    path1
5    A  item2    path3
6    A  item2    path4

The desired frequency is groupby & transform("count"), similar to what you did

In [491]: g = df.groupby(["Name", "Number"])["Filename"]

In [492]: g.transform("count")
Out[492]:
0    2
1    2
2    4
3    4
4    1
5    4
6    4
Name: Filename, dtype: int64

Cool but need to mask the dupes here. And not globally but locally and only retain the first. So here's a trick: if i do g.cumcount(), it will give 0, 1, ... per group. Then I use 0's specialness to get the mask

In [493]: g.cumcount()
Out[493]:
0    0
1    1
2    0
3    1
4    0
5    2
6    3
dtype: int64

In [494]: ~g.cumcount().astype(bool)
Out[494]:
0     True
1    False
2     True
3    False
4     True
5    False
6    False
dtype: bool

Then i can multiply the counts with this, replace 0s (arising due to False's in multiplication) with empty string, set a multiindex & sort it and done:

In [496]: counts = g.transform("count")

In [497]: first_per_gr_mask = ~g.cumcount().astype(bool)

In [498]: freq = counts.mul(first_per_gr_mask).replace(0, "")

In [499]: freq
Out[499]:
0    2
1
2    4
3
4    1
5
6
dtype: object

In [500]: (df.assign(Frequency=freq)
     ...:    .set_index(["Name", "Number", "Filename"])
     ...:    .sort_index())
Out[500]:
                     Frequency
Name Number Filename
A    item1  path1            2
            path2
     item2  path1            4
            path2
            path3
            path4
     item3  path1            1

Upvotes: 2

Related Questions