Reputation: 18810
I have following data frame. The data frame is constructed by reading a csv file. Its a large data set but for this question purpose I have used 15 rows from the data set as an example.
user_id contrib_count total_min_length group_space expert_level
0 23720 108 1112696 0 l-2
1 23720 13 442059 1 l-2
2 23720 12 32180 2 l-2
3 23720 2 20177 3 l-2
4 23720 1 1608 10 l-2
5 1265184 71 260186 0 l-G
6 1265184 10 3466 2 l-G
7 1265184 1 12081 4 l-G
8 513380 112 1049311 0 l-4
9 513380 1 97 1 l-4
10 513380 113 361980 2 l-4
11 513380 19 1198323 3 l-4
12 513380 2 88301 4 l-4
13 20251 705 17372707 0 l-G
14 20251 103 2327178 1 l-G
Expected Results After pivot what I want is following data frame:
group_space 0 1 2 3 4 5 6 7 8 9 10 expert_level
user_id
20251 705 103 68 24 18 2 6 NaN NaN 5 22 l-G
23720 108 13 12 2 NaN NaN NaN NaN NaN NaN 1 l-2
Reason I am doing this is once I do this I can use this for a prediction task where expert_level
as label data.
So far I have done following to to build the above matrix but I am unable to get the expert_level
column as shown after the pivot.
This is what I have done:
class GroupAnalysis():
def __init__(self):
self.df = None
self.filelocation = '~/somelocation/x.csv'
def pivot_dataframe(self):
raw_df = pd.read_csv(self.filelocation)
self.df = raw_df[(raw_df['group_space'] < 11)]
self.df.set_index(['user_id', 'group_space'], inplace=True)
self.df = self.df['contrib_count'].unstack()
By doing this I get:
group_space 0 1 2 3 4 5 6 7 8 9 10
user_id
20251 705 103 68 24 18 2 6 NaN NaN 5 22
23720 108 13 12 2 NaN NaN NaN NaN NaN NaN 1
As you can see I am missing the expert_level
column at the end. So the question is How can I get above data frame with the expert_level as I shown in my "Expected Results"?
Upvotes: 6
Views: 8419
Reputation: 571
When you unstacked, you were only unstacking a series contrib_count
- expert_level
and total_min_length
were already gone at that point.
Instead of setting index and unstacking, you can just use .pivot()
pivoted = df.pivot('user_id', 'group_space', 'contrib_count')
Then, create a frame with user_id
as the index and expert_level
as a column, getting rid of duplicates:
lookup = df.drop_duplicates('user_id')[['user_id', 'expert_level']]
lookup.set_index(['user_id'], inplace=True)
Then join your pivot
and lookup
result = pivoted.join(lookup)
EDIT:
If you also want to include total_min_length
, you can do a second pivot:
pivoted2 = df.pivot('user_id', 'group_space', 'total_min_length')
and join all three instead of two:
result = pivoted.join(lookup).join(pivoted2, lsuffix="_contrib_count", rsuffix="_total_min_length")
Note that lsuffix
and rsuffix
are required to disambiguate columns, as both pivots have 0, 1, 2, 3, 4
, and 10
columns from your example data.
Upvotes: 3