Reputation: 385
I have a DataFrame with below data-points. Here count
represents number of times articleTag is read. articleTag
is the tag
for articleId; i.e. for articleId 590020
there are four tags A,B,C,D
which is represented as one string.
articleId articleTag count
0 590020 A,B,C,D 2
1 466322 A,B,E 3
2 466322 B 2
3 466322 A 1
I need to find out tag distribution
i.e. for each tag how many number of times it appeared in articles and how many number of times its read.
like in above sample Dataframe.
Tag Present Read
A 3 6
B 3 7
C 1 2
D 1 2
E 1 3
Pl help.
Upvotes: 1
Views: 48
Reputation: 551
df = pd.DataFrame([{"articleId": 590020, "articleTag": "A,B,C,D ", "count": 2},
{"articleId": 590021, "articleTag": "A,B,E", "count": 3},
{"articleId": 466322, "articleTag": "B", "count": 2},
{"articleId": 466322, "articleTag": "A", "count": 1}])
articles = []
for val in df['articleTag'].values:
articles.extend(val.split(','))
unique_articles = list(set(articles))
final_dict = {}
final_dict['article'] = unique_articles
final_count = []
final_read = []
for article in unique_articles:
l = [val for val in df['articleTag'].values if article in val]
l2 = [val[1] for val in zip(df['articleTag'].values,df['count'].values) if article in val[0]]
final_read.append(sum(l2))
final_count.append(len(l))
final_dict['Present'] = final_count
final_dict['Read'] = final_read
pd.DataFrame(final_dict)
Upvotes: 1
Reputation: 7835
You could do something like this:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([{"articleId": 590020, "articleTag": "A,B,C,D ", "count": 6}, {"articleId": 590021, "articleTag": "A,B,E", "count": 3}])
In [3]: df[df.articleTag.str.contains("A")]['count'].sum()
Out: 9
In [4]: len(df[df.articleTag.str.contains("A")])
Out[4]: 2
The first is your "Read" value and the second is your "Present" value.
To find all the various tags, I'd probably do something like this:
In [5]: tag_df = df.articleTag.str.split(',', expand=True)
In [6]: for column in tag_df.columns:
...: print(tag_df[column].unique())
...:
...:
['A']
['B']
['C' 'E']
['D ' None]
Instead of printing them you can add them to a set
and collect all the tags you need to look for that way.
In [7]: unique_tags = set()
In [8]: for column in tag_df.columns:
...: unique_tags |= set(tag_df[column].unique())
...:
...:
...:
...:
In [9]: unique_tags
Out[9]: {'B', 'A', 'C', 'E', None, 'D '}
Of course you'll have to pull out the None vals.
Upvotes: 1