Lisa
Lisa

Reputation: 337

Python: How to change same numbers in a Series/Column to other values?

I am trying to change the values of a very long column (about 1mio entries) in a data frame. I have something like

####ID_Orig
3452  
3452  
3452  
6543  
6543
...

I want something like

####ID_new
0  
0  
0  
1  
1  
...

At the moment I'm doing this:

j=0
for i in range(0,1199531): 
    if data.ID_orig[i]==data.ID_orig[i+1]:
        data.ID_orig[i] = j
    else:
        data.ID_orig[i] = j
        j=j+1

Which takes about ages... Is there a faster way to do this? I don't know what values ID_orig has and how often a single value comes up.

Upvotes: 1

Views: 236

Answers (3)

Clock Slave
Clock Slave

Reputation: 7957

You can use the following. In the following implementation duplicate ids in the original id will get same ids. The implementation is based on dropping duplicates from the column and assigning a different number to each unique id to form the enw ids. These new ids are then merged into the original dataset

import numpy as np
import pandas as pd
from time import time

num_rows = 119953
input_data = np.random.randint(1199531, size=(num_rows,1))
data = pd.DataFrame(input_data)
data.columns = ["ID_orig"]

data2 = pd.DataFrame(input_data)
data2.columns = ["ID_orig"]

t0 = time()
j=0
for i in range(0,num_rows-1): 
    if data.ID_orig[i]==data.ID_orig[i+1]:
        data.ID_orig[i] = j
    else:
        data.ID_orig[i] = j
        j=j+1

t1 = time()
id_new = data2.loc[:,"ID_orig"].drop_duplicates().reset_index().drop("index", axis=1)
id_new.reset_index(inplace=True)
id_new.columns = ["id_new"] + id_new.columns[1:].values.tolist() 
data2 = data2.merge(id_new, on="ID_orig")

t2 = time()

print("Previous: ", round(t1-t0, 2), " seconds")
print("Current : ", round(t2-t1, 2), " seconds")

The output of the above program using only 119k rows is

Previous:  12.16 seconds
Current :  0.06 seconds

The runtime difference increases even more as the number of rows are increased.

EDIT Using the same number of rows:

>>> print("Previous: ", round(t1-t0, 2))
Previous:  11.7
>>> print("Current : ", round(t2-t1, 2))
Current :  0.06
>>> print("jezrael's answer : ", round(t3-t2, 2))
jezrael's answer :  0.02

Upvotes: 0

jezrael
jezrael

Reputation: 862481

Use factorize, but if duplicated groups then output values are set to same number.

Another solution with comparing by ne (!=) of shifted values with cumsum is more general - create always new values, also if repeating group values:

df['ID_new1'] = pd.factorize(df['ID_Orig'])[0]
df['ID_new2'] = df['ID_Orig'].ne(df['ID_Orig'].shift()).cumsum() - 1
print (df)
   ID_Orig  ID_new1  ID_new2
0     3452        0        0
1     3452        0        0
2     3452        0        0
3     6543        1        1
4     6543        1        1
5      100        2        2
6      100        2        2
7     6543        1        3 <-repeating group
8     6543        1        3 <-repeating group

Upvotes: 1

Red Cricket
Red Cricket

Reputation: 10460

You can do this …

import collections


l1 = [3452, 3452, 3452, 6543, 6543]
c = collections.Counter(l1)
l2 = list(c.items())
l3 = []

for i, t in enumerate(l2):
    for x in range(t[1]):
        l3.append(i)

for x in l3:
    print(x)

This is the output:

0
0
0
1
1

Upvotes: 0

Related Questions