Seema Mudgil
Seema Mudgil

Reputation: 385

Counting number of times A part of String is present is the DataFrame

I have a DataFrame with below data-points. Here count represents number of times articleTag is read. articleTag is the tag for articleId; i.e. for articleId 590020 there are four tags A,B,C,D which is represented as one string.

      articleId     articleTag       count  
  0     590020      A,B,C,D             2   
  1     466322      A,B,E               3   
  2     466322      B                   2   
  3     466322      A                   1   

I need to find out tag distribution i.e. for each tag how many number of times it appeared in articles and how many number of times its read.

like in above sample Dataframe.

Tag       Present       Read
A           3            6
B           3            7
C           1            2
D           1            2
E           1           3

Pl help.

Upvotes: 1

Views: 48

Answers (2)

Sunnysinh Solanki
Sunnysinh Solanki

Reputation: 551

df = pd.DataFrame([{"articleId": 590020, "articleTag": "A,B,C,D ", "count": 2}, 
                   {"articleId": 590021, "articleTag": "A,B,E", "count": 3},
                  {"articleId": 466322, "articleTag": "B", "count": 2},
                   {"articleId": 466322, "articleTag": "A", "count": 1}])
articles = []
for val in df['articleTag'].values:
    articles.extend(val.split(','))
unique_articles = list(set(articles))

final_dict = {}
final_dict['article'] = unique_articles
final_count = []
final_read = []
for article in unique_articles:
        l = [val for val in df['articleTag'].values if article in val]
        l2 = [val[1] for val in zip(df['articleTag'].values,df['count'].values) if article in val[0]]
        final_read.append(sum(l2))
        final_count.append(len(l))
final_dict['Present'] = final_count
final_dict['Read'] = final_read
pd.DataFrame(final_dict)

Upvotes: 1

erewok
erewok

Reputation: 7835

You could do something like this:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([{"articleId": 590020, "articleTag": "A,B,C,D ", "count": 6}, {"articleId": 590021, "articleTag": "A,B,E", "count": 3}])

In [3]: df[df.articleTag.str.contains("A")]['count'].sum()
Out: 9

In [4]: len(df[df.articleTag.str.contains("A")])
Out[4]: 2

The first is your "Read" value and the second is your "Present" value.

To find all the various tags, I'd probably do something like this:

In [5]: tag_df = df.articleTag.str.split(',', expand=True)

In [6]: for column in tag_df.columns:
...:     print(tag_df[column].unique())
...:     
...:     
['A']
['B']
['C' 'E']
['D ' None]

Instead of printing them you can add them to a set and collect all the tags you need to look for that way.

In [7]: unique_tags = set()

In [8]: for column in tag_df.columns:
    ...:     unique_tags |= set(tag_df[column].unique())
    ...:     
    ...:     
    ...:     
    ...:     

In [9]: unique_tags
Out[9]: {'B', 'A', 'C', 'E', None, 'D '}

Of course you'll have to pull out the None vals.

Upvotes: 1

Related Questions