Reputation: 31
Original Dataframe is
column_one
1
1
1
45
45
55
55
56
Expected Output
column-new
i_1
i_1
i_1
i_2
i_2
i_3
i_3
i_4
Based on Column-1 I want to add another new column in my dataframe. Where there is a consecutive values than add 'i' with the same index. Thank you in advance.
Upvotes: 2
Views: 85
Reputation: 81
Depending on how fast this needs to perform, you can look into using categoricals (dtype="category"
) , as they should be incredibly fast with large data sets.
If you import your data as a category
data type, this will already determine which are the unique values.
df["col1"] = df["col1"].astype('category')
From here, you can then implement the leading "i_" and output the category value, remembering to scale up so that you begin with 1 rather than 0:
df['newcol1'] = "i_" + (df["col1"].cat.codes + 1).astype(str)
Output
col1 newcol1
0 1 i_1
1 1 i_1
2 1 i_1
3 45 i_2
4 45 i_2
5 55 i_3
6 55 i_3
7 56 i_4
Timings
As the code is simply reading the category index, timing the category lookup against the factorize function for a column of 10,000,000 values between 0 and 1000 gives a timing that is far faster for the category approach. This is because you are not calling a function, but instead reading the index.
It should be noted that there is an initial setup overhead involved (also shown for completion), so the factorize function would be better if you are only performing this once.
Categoricals: 0 ms
Factorize: 2092 ms
Categoricals Converstion: 3253 ms
Timings Code:
import numpy as np
import pandas as pd
import time
def timing(label, fn):
t0 = time.time()
fn()
t1 = time.time()
print '%s: %d ms' % (label, int((t1 - t0) * 1000))
df = pd.DataFrame(np.random.randint(low=0, high=1000, size=(100000000, 1)), columns=["col1"])
df["col1"] = df["col1"].astype('category')
timing('Categoricals', lambda: (df.col1.cat.codes))
timing('Factorize', lambda: (df.col1.factorize()))
Upvotes: 0
Reputation: 88226
You could use pd.factorize
. From the docs:
Useful for obtaining a numeric representation of an array when all that matters is identifying distinct values.
So it will encode each new value it encounters as an enumerated type. Afterwards you can simply add the 'i_'
prefix to the new_col
:
df['new_col'] = (df.col1.factorize()[0] + 1).astype(str)
df['new_col'] = 'i_' + df.new_col
Output
col1 new_col
0 1 i_1
1 1 i_1
2 1 i_1
3 45 i_2
4 45 i_2
5 55 i_3
6 55 i_3
7 56 i_4
Upvotes: 2