Reputation: 329
EDITED :
I have the following dataframe
Name Code
Cedric AMMMM
Joe A
Mickael AMMCX
Dupond MMMMMMM
Jean AMMMCMC
I want to count the number of character occurrence of values for the Code columns. And replace the value by the concatenation of the character and the number of occurence.
My expected result is the following :
Name Code
Cedric 1A4M
Joe 1A
Mickael 1A2M1C1X
Dupond 7M
Jean 1A3M1C1M1C
I have tried with the following method :
for index, row in df.iterrows():
for i in "".join(set(row.Code)):
num = test.count(i)
df.loc[index,"Code"] = val + str(num) + i
But in reality i have a huge dataframe of more than 800 000 rows. And when i execute this code, the process is too long.
I'm searching a better solution to do that.
Edited: I added a last example to my dataframes. Previous responses doesn't handle this example. And i want to handle this use case
Thanks for your help.
Upvotes: 4
Views: 2102
Reputation: 329
Thanks all,
Here are a comparison of two methods:
from itertools import groupby
%timeit df['Code'] = [''.join(f"{len(''.join(group))}{key}" for key, group in groupby(x)) for x in df['Code']]
CPU times: user 511 µs, sys: 7 µs, total: 518 µs
Wall time: 524 µs
and
def encode(code):
cpt=1
n=len(code)
res=''
for i in range(n):
if i == n-1 or code[i] != code[i+1]:
res += str(cpt)+code[i]
cpt=1
else: cpt+=1
return res
%timeit result['CDSCENARIO']=result.CDSCENARIO.apply(encode)
CPU times: user 855 µs, sys: 10 µs, total: 865 µs
Wall time: 871 µs
First method is faster than second.
Upvotes: 0
Reputation: 18628
Counting must care about non consecutives duplicates.
first a function which encode a code :
def encode(code):
cpt=1
n=len(code)
res=''
for i in range(n):
if i == n-1 or code[i] != code[i+1]:
res += str(cpt)+code[i]
cpt=1
else: cpt+=1
return res
Example: scan('AABBCA')
-> '2A2B1C1A'
.
Then just apply : df['Code']=df.Code.apply(encode)
, for :
Name Code
0 Cedric 1A4M
1 Joe 1A
2 Mickael 1A2M1C1X
3 Dupond 7M
4 Jean 1A3M1C1M1C
Upvotes: 1
Reputation: 1260
You can use Counter from collections
in order to count the occurrences. Later you can join
the key and value pairs. Over that you can apply df.apply
function of a pandas DataFrame
from collections import Counter as ctr
df['Code'] = df['Code'].apply(lambda x: ''.join([''.join(map(str, val[::-1])) for val in ctr(x).items()]))
Here i am using val[::-1]
, so that output will get at par with your expectations.
Name Code
0 Cedric 1A4M
1 Joe 1A
2 Mickael 1A1X1C2M
3 Dupond 7M
Upvotes: 0
Reputation: 862406
Use list comprehension with f-string
working for python 3.6+
and also add sorted
by index for not change ordering:
df['Code'] = [''.join(f'{x.count(i)}{i}' for i in sorted(set(x),key=x.index)) for x in df['Code']]
Or use Counter
:
from collections import Counter
df['Code'] = [''.join(f'{j}{i}' for i, j in Counter(x).items()) for x in df['Code']]
print (df)
Name Code
0 Cedric 1A4M
1 Joe 1A
2 Mickael 1A2M1C1X
3 Dupond 7M
Performance:
#[40000 rows x 2 columns]
df = pd.concat([df] * 10000, ignore_index=True)
In [119]: %timeit df['Code'] = [''.join(f'{j}{i}' for i, j in Counter(x).items()) for x in df['Code']]
276 ms ± 9.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [120]: %timeit df['Code'] = [''.join(f'{x.count(i)}{i}' for i in sorted(set(x),key=x.index)) for x in df['Code']]
262 ms ± 3.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#U9-Forward solution
In [124]: %timeit df['Code']=df['Code'].apply(lambda x: ''.join([''.join(map(str,i)) for i in Counter(x).items()]))
339 ms ± 51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 2
Reputation: 71560
Maybe collections.Counter
in an apply
parameter, and also use a double ''.join
for making a string from a dict
ionary:
from collections import Counter
df['Code']=df['Code'].apply(lambda x: ''.join([''.join(map(str,i)) for i in Counter(x).items()]))
And now:
print(df)
Is:
Name Code
0 Cedric A1M4
1 Joe A1
2 Mickael A1M2C1X1
3 Dupond M7
Upvotes: 1