Reputation: 382
I have a following question, hoping you can support me. Suppose you have following frame
df
Name Number File_name Frequency
0 A item 1 path 1 2
1 A item 1 path 2 2
2 A item 2 path 1 4
3 A item 2 path 2 4
4 A item 3 path 1 1
5 A item 2 path 3 4
6 A item 2 path 4 4
my goal is to create a pivot table with following shape
df= Name Number File_name Frequency
A item 1 path 1 2
path 2
item 2 path 1 4
path 2
path 3
path 4
item 3 path 1 1
I tried the following way, since I do not know how to count the combination of an index pair for a multi-dimensional pivot table
df=pd.read_excel(path)
df["Unique ID"]=df["Name"]+" "+df["Number"] #Create an additional column to calculate frequency
df['frequency'] = df['Unique ID'].map(df['Unique ID'].value_counts())
pivot=pd.pivot_table(df,index=["Name",
"Number","File_name"],
values="frequency",fill_value=0)
print(pivot)
pivot= Name Number File_name Frequency
A item 1 path 1 2
path 2 2
item 2 path 1 4
path 2 4
path 3 4
path 4 4
item 3 path 1 1
My question is: How can I, should I change my code in order to get the desired outcome, i.e. not to have the frequency values as duplicates, but rather also as an index with unique values
Upvotes: 0
Views: 182
Reputation: 14238
You could use:
df = df.sort_values(df.columns.drop('Frequency').tolist(), ignore_index=True)
df2 = (df == df.shift()).replace(True, '')
df2 = df2.mask(df2 != '').fillna(df)
print(df2):
Name Number File_name Frequency
0 A item 1 path 1 2
1 path 2
2 item 2 path 1 4
3 path 2
4 path 3
5 path 4
6 item 3 path 1 1
Explanation:
a) df2 = (df == df.shift())
which gives:
Name Number File_name Frequency
0 False False False False
1 True True False True
2 True False False False
3 True True False True
4 True True False True
5 True True False True
6 True False False False
b) Replace True
values with empty string ''
df2 = df2.replace(True, '')
which gives:
Name Number File_name Frequency
0 False False False False
1 False
2 False False False
3 False
4 False
5 False
6 False False False
c) Mask non-empty values with NaN
df2.mask(df2 != '')
Name Number File_name Frequency
0 NaN NaN NaN NaN
1 NaN
2 NaN NaN NaN
3 NaN
4 NaN
5 NaN
6 NaN NaN NaN
d) fillna of this df with the original one
df2.fillna(df)
Name Number File_name Frequency
0 A item 1 path 1 2
1 path 2
2 item 2 path 1 4
3 path 2
4 path 3
5 path 4
6 item 3 path 1 1
Upvotes: 0
Reputation: 13242
What you want doesn't make a lot of sense, if you want something to act like an index... make it an index, otherwise you're just manually creating a non-functional, visually-pleasing, output. (Which Mustafa does a good job of executing.)
df['blank'] = np.nan
df = df.set_index(['Name', 'Number', 'Frequency', 'File_name']).sort_index()
print(df)
# Output:
blank
Name Number Frequency File_name
A item 1 2 path 1 NaN
path 2 NaN
item 2 4 path 1 NaN
path 2 NaN
path 3 NaN
path 4 NaN
item 3 1 path 1 NaN
Upvotes: 1
Reputation: 18315
i start with this frame
In [485]: df
Out[485]:
Name Number Filename
0 A item1 path1
1 A item1 path2
2 A item2 path1
3 A item2 path2
4 A item3 path1
5 A item2 path3
6 A item2 path4
The desired frequency is groupby & transform("count"), similar to what you did
In [491]: g = df.groupby(["Name", "Number"])["Filename"]
In [492]: g.transform("count")
Out[492]:
0 2
1 2
2 4
3 4
4 1
5 4
6 4
Name: Filename, dtype: int64
Cool but need to mask the dupes here. And not globally but locally and only retain the first. So here's a trick: if i do g.cumcount()
, it will give 0, 1, ...
per group. Then I use 0's specialness to get the mask
In [493]: g.cumcount()
Out[493]:
0 0
1 1
2 0
3 1
4 0
5 2
6 3
dtype: int64
In [494]: ~g.cumcount().astype(bool)
Out[494]:
0 True
1 False
2 True
3 False
4 True
5 False
6 False
dtype: bool
Then i can multiply the counts with this, replace 0s (arising due to False's in multiplication) with empty string, set a multiindex & sort it and done:
In [496]: counts = g.transform("count")
In [497]: first_per_gr_mask = ~g.cumcount().astype(bool)
In [498]: freq = counts.mul(first_per_gr_mask).replace(0, "")
In [499]: freq
Out[499]:
0 2
1
2 4
3
4 1
5
6
dtype: object
In [500]: (df.assign(Frequency=freq)
...: .set_index(["Name", "Number", "Filename"])
...: .sort_index())
Out[500]:
Frequency
Name Number Filename
A item1 path1 2
path2
item2 path1 4
path2
path3
path4
item3 path1 1
Upvotes: 2