Jenny Jing Yu
Jenny Jing Yu

Reputation: 195

Remove duplicated and count value in the same cell of a da

I had a sample Dataframe with column a containing the duplicated values like below:

        a
0   1089, 1089, 1089
1   10A3, 10A3
2   10A3, 10A4, 10A4
3   TEL, TV
4   EZ, EZ
5   ABC Co., ABC Co.

I would like to remove the duplicates and count the value of each cell:

      a               count
0   1089                1
1   10A3                1
2   10A3, 10A4          2
3   TEL, TV             2
4   EZ                  1
5   ABC Co.             1

Upvotes: 2

Views: 75

Answers (3)

Parijat Bhatt
Parijat Bhatt

Reputation: 674

Try this


def f(x):

    l = x.split(',')

    d = {}

    for key in l:
        if key.rstrip() not in d:
            d[key.rstrip()] = 0
        d[key.rstrip()]+=1

    return ','.join(list(d.keys()))
df['a_new'] = df['a'].apply(lambda x:f(x))
print(df)
df['count'] = df['a_new'].apply(lambda x: len(x.split(',')))

Upvotes: 0

rafaelc
rafaelc

Reputation: 59274

Use str.get_dummies and sum across axis=1

df['count'] = df.a.str.get_dummies(sep=', ').sum(1)

To remove the duplicates, use explode

s = df.assign(a=df.a.str.split(', ')).explode('a').drop_duplicates()

         a  count
0     1089      1
1     10A3      1
2     10A3      2
2     10A4      2
3      TEL      2
3       TV      2
4       EZ      1
5  ABC Co.      1

If you really need it all in the same line...

s.groupby(s.index).agg({'a': ', '.join, 'count': 'first'})

          a  count
0        1089      1
1        10A3      1
2  10A3, 10A4      2
3     TEL, TV      2
4          EZ      1
5     ABC Co.      1

Or simply use @WeNYoBen ingenious solution ;)

s=df.a.str.get_dummies(sep=', ')
df['a']=s.dot(s.columns+',').str[:-1]
df['count']=s.sum(1)

Upvotes: 4

harpan
harpan

Reputation: 8631

You need to define your own method and apply it on your dataframe.

def list_count(x):
    l=pd.Series(x.split(',')).str.strip().drop_duplicates().tolist()
    return pd.Series([', '.join(l), len(l)])

df['a'].apply(lambda x: list_count(x)).rename(columns={0:'a', 1:'count'})

Output:

            a  count
0        1089      1
1        10A3      1
2  10A3, 10A4      2
3     TEL, TV      2
4          EZ      1
5     ABC Co.      1

Upvotes: 2

Related Questions