Reputation: 37
I have a column of integers, some are unique and some are the same. I want to add a column of random floats between 0 and 1 per row, but I want all of the floats to be the same per integer.
The code I'm providing shows a column of ints and a second column of random floats, but I need the floats for the same ints, like 1, 1, and 1, or 6 and 6, to all be the same, while still having whatever the float assigned to that int randomly generated. The ints I'm working with, however, are 8 digits, and the data set I am using is about 500,000 lines, so I am trying to be as efficient as possible.
I've created a working solution that iterates through the data frame that has already been created, but creating the random column, then iterating through checking like ints takes long. I wasn't sure if there was a more efficient method.
import numpy as np
import pandas as pd
col1 = [1,1,1,2,3,3,3,4,5,6,6,7]
col2 = np.random.uniform(0,1,12)
data = np.array([col1, col2])
df1 = pd.DataFrame(data=data)
df1 = df1.transpose()
Upvotes: 1
Views: 2009
Reputation: 30609
Use transform
after a groupby
:
col1 = [1,1,1,2,3,3,3,4,5,6,6,7]
df = pd.DataFrame(col1, columns=['Col1'])
df['Col2'] = df.groupby('Col1')['Col1'].transform(lambda x: np.random.rand())
Result:
Col1 Col2
0 1 0.304472
1 1 0.304472
2 1 0.304472
3 2 0.883114
4 3 0.381417
5 3 0.381417
6 3 0.381417
7 4 0.668433
8 5 0.365895
9 6 0.484803
10 6 0.484803
11 7 0.403913
This takes about 200 ms for 600K rows on my old laptop computer.
Upvotes: 1
Reputation: 2333
Create a dictionary with random floats for each integer key, and then map Column 2 to the dictionary.
For integers already in Column1, start by making the dictionary:
myInts = df.Column1.unique().tolist()
myFloats = [random.uniform(0,1) for i in range(len(myInts))]
myDictionary = dict(list(zip(myInts , myFloats )))
This will give you:
{0: 0.7361124230574458,
1: 0.8039650720388128,
2: 0.7474880952026456,
3: 0.06792890878546265,
4: 0.4765215518349696,
5: 0.8058550699163101,
6: 0.8865969467094966,
7: 0.251791893958454,
8: 0.42261798056239686,
9: 0.03972320851777933,
....
}
Then map the dictionary keys to Column 1 so that each identical integer gets the same float. Something like:
df.Column2 = df.Column1.map(myDictionary)
More info on how to map a series to a dictionary is here:
Using if/else in pandas series to create new series based on conditions
In this way you can get the desired results without rearranging your dataframe or iterating through it.
Cheers!
Upvotes: 0
Reputation: 320
This isn't totally iteration-free, but you're still only iterating over groups rather than every single row, so it's a touch better:
col1 = [1,1,1,2,3,3,3,4,5,6,6,7]
col2 = np.random.uniform(0,1,len(set(col1)))
data = np.array([col1])
df1 = pd.DataFrame(data=data)
df1 = df1.transpose()
df2 = df1.groupby(0)
counter = 0
final_df = pd.DataFrame(columns=[0,1])
for key, item in df2:
temp_df = df2.get_group(key)
temp_df[1] = [col2[counter]]*df2.get_group(key).shape[0]
counter += 1
final_df = final_df.append(temp_df)
final_df should be the result you're looking for.
Upvotes: 0