Reputation: 33410
Suppose I have the following dataframe:
name tags
0 abc (1990) AB|A|BC
1 def (2000) BC|AB
2 yz (1990) A|AB
The values in tags
column are pipe-separated. Further, the format of the values in column name
is like description (year)
. I want to count the number of tags for each year to get a Series
object like this:
year
1990 A 2
AB 2
BC 1
2000 AB 1
BC 1
Or its equivalent as a DataFrame
:
year tags count
0 1990 A 2
1 1990 AB 2
2 1990 BC 1
3 2000 AB 1
4 2000 BC 1
I have a solution for this however since it involves defining a custom function to pass to apply
method, I was wondering whether more compact or efficient solution exists?
Here is my current solution:
years = df['name'].str.slice(start=-5, stop=-1).rename('year')
new_df = df['tags'].str.split('|', expand=True).join(years)
def count_tags(g):
return g.drop(columns=['year']).stack().value_counts()
new_df.groupby('year').apply(count_tags)
which gives:
year
1990 A 2
AB 2
BC 1
2000 AB 1
BC 1
dtype: int64
P.S. It does not matter for me whether the year
is stored as string or integer in the result.
Upvotes: 1
Views: 125
Reputation: 862611
Use:
new_df = (df.assign(year=lambda x: x['name'].str[-5:-1])
.set_index('year')['tags']
.str.split('|', expand=True)
.stack()
.reset_index(name='tags')
.groupby(['year','tags'])
.size()
.reset_index(name='count'))
print (new_df)
year tags count
0 1990 A 2
1 1990 AB 2
2 1990 BC 1
3 2000 AB 1
4 2000 BC 1
Explanation:
assign
for new column with slicingyear
use set_index
split
for DataFrame and reshape by stack
for Series
with MultiIndex
reset_index
groupby
and aggregate size
, last reset_index
for column count
Another solution:
from itertools import chain
tags = df['tags'].str.split('|')
df1 = pd.DataFrame({
'tags' : list(chain.from_iterable(tags.values.tolist())),
'year' : df['name'].str[-5:-1].repeat(tags.str.len())
})
print (df1)
tags year
0 AB 1990
1 A 1990
2 BC 1990
3 BC 2000
4 AB 2000
5 A 1990
6 AB 1990
df2 = df1.groupby(['year','tags']).size().reset_index(name='count')
print (df2)
year tags count
0 1990 A 2
1 1990 AB 2
2 1990 BC 1
3 2000 AB 1
4 2000 BC 1
Explanation:
split
len
repeat
columns and flatteninggroupby
and aggregate size
Upvotes: 2