Reputation: 329

Python - Pandas - Count the number of character occurrence in a string and replace the string value

EDITED :

I have the following dataframe

Name        Code    
Cedric      AMMMM           
Joe         A       
Mickael     AMMCX           
Dupond      MMMMMMM

Jean        AMMMCMC

I want to count the number of character occurrence of values for the Code columns. And replace the value by the concatenation of the character and the number of occurence.

My expected result is the following :

Name        Code    
Cedric      1A4M            
Joe         1A      
Mickael     1A2M1C1X            
Dupond      7M

Jean        1A3M1C1M1C

I have tried with the following method :

for index, row in df.iterrows():
    for i in "".join(set(row.Code)):
        num = test.count(i)
        df.loc[index,"Code"] = val + str(num) + i

But in reality i have a huge dataframe of more than 800 000 rows. And when i execute this code, the process is too long.

I'm searching a better solution to do that.

Edited: I added a last example to my dataframes. Previous responses doesn't handle this example. And i want to handle this use case

Thanks for your help.

Upvotes: 4

Answers (5)

Clement Ros

Reputation: 329

Thanks all,

Here are a comparison of two methods:

from itertools import groupby

%timeit df['Code'] = [''.join(f"{len(''.join(group))}{key}" for key, group in groupby(x)) for x in df['Code']]

CPU times: user 511 µs, sys: 7 µs, total: 518 µs
Wall time: 524 µs

and

def encode(code):
    cpt=1 
    n=len(code)
    res=''
    for i in range(n):
        if i == n-1 or code[i] != code[i+1]:
            res += str(cpt)+code[i]
            cpt=1
        else: cpt+=1
    return res

%timeit result['CDSCENARIO']=result.CDSCENARIO.apply(encode)

CPU times: user 855 µs, sys: 10 µs, total: 865 µs
Wall time: 871 µs

First method is faster than second.

Upvotes: 0

B. M.

Reputation: 18628

Counting must care about non consecutives duplicates.

first a function which encode a code :

def encode(code):
    cpt=1 
    n=len(code)
    res=''
    for i in range(n):
        if i == n-1 or code[i] != code[i+1]:
            res += str(cpt)+code[i]
            cpt=1
        else: cpt+=1
    return res

Example: scan('AABBCA') -> '2A2B1C1A'.

Then just apply : df['Code']=df.Code.apply(encode), for :

      Name       Code
0   Cedric       1A4M
1      Joe         1A
2  Mickael   1A2M1C1X
3   Dupond         7M
4     Jean 1A3M1C1M1C

Upvotes: 1

Shrey

Reputation: 1260

You can use Counter from collections in order to count the occurrences. Later you can join the key and value pairs. Over that you can apply df.apply function of a pandas DataFrame

from collections import Counter as ctr
df['Code'] = df['Code'].apply(lambda x: ''.join([''.join(map(str, val[::-1])) for val in ctr(x).items()]))

Here i am using val[::-1], so that output will get at par with your expectations.

    Name      Code  
0   Cedric    1A4M
1   Joe       1A    
2   Mickael   1A1X1C2M  
3   Dupond    7M

Upvotes: 0

jezrael

Reputation: 862406

Use list comprehension with f-string working for python 3.6+ and also add sorted by index for not change ordering:

df['Code'] = [''.join(f'{x.count(i)}{i}' for i in sorted(set(x),key=x.index)) for x in df['Code']]

Or use Counter:

from collections import Counter

df['Code'] = [''.join(f'{j}{i}' for i, j in Counter(x).items()) for x in df['Code']]


print (df)
      Name      Code
0   Cedric      1A4M
1      Joe        1A
2  Mickael  1A2M1C1X
3   Dupond        7M

Performance:

#[40000 rows x 2 columns]
df = pd.concat([df] * 10000, ignore_index=True)

In [119]: %timeit df['Code'] = [''.join(f'{j}{i}' for i, j in Counter(x).items()) for x in df['Code']]
276 ms ± 9.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [120]: %timeit df['Code'] = [''.join(f'{x.count(i)}{i}' for i in sorted(set(x),key=x.index)) for x in df['Code']]
262 ms ± 3.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#U9-Forward solution
In [124]: %timeit df['Code']=df['Code'].apply(lambda x: ''.join([''.join(map(str,i)) for i in Counter(x).items()]))
339 ms ± 51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 2

U13-Forward

Reputation: 71560

Maybe collections.Counter in an apply parameter, and also use a double ''.join for making a string from a dictionary:

from collections import Counter
df['Code']=df['Code'].apply(lambda x: ''.join([''.join(map(str,i)) for i in Counter(x).items()]))

And now:

print(df)

Is:

      Name      Code
0   Cedric      A1M4
1      Joe        A1
2  Mickael  A1M2C1X1
3   Dupond        M7

Upvotes: 1

Python - Pandas - Count the number of character occurrence in a string and replace the string value

Answers (5)

Related Questions