bw1997
bw1997

Reputation: 37

Adding column of random floats to data frame, but with equal values for equal data frame entries

I have a column of integers, some are unique and some are the same. I want to add a column of random floats between 0 and 1 per row, but I want all of the floats to be the same per integer.

The code I'm providing shows a column of ints and a second column of random floats, but I need the floats for the same ints, like 1, 1, and 1, or 6 and 6, to all be the same, while still having whatever the float assigned to that int randomly generated. The ints I'm working with, however, are 8 digits, and the data set I am using is about 500,000 lines, so I am trying to be as efficient as possible.

I've created a working solution that iterates through the data frame that has already been created, but creating the random column, then iterating through checking like ints takes long. I wasn't sure if there was a more efficient method.

import numpy as np
import pandas as pd

col1 = [1,1,1,2,3,3,3,4,5,6,6,7]
col2 = np.random.uniform(0,1,12)

data = np.array([col1, col2])

df1 = pd.DataFrame(data=data)
df1 = df1.transpose()

Upvotes: 1

Views: 2009

Answers (3)

Stef
Stef

Reputation: 30609

Use transform after a groupby:

col1 = [1,1,1,2,3,3,3,4,5,6,6,7]
df = pd.DataFrame(col1, columns=['Col1'])

df['Col2'] = df.groupby('Col1')['Col1'].transform(lambda x: np.random.rand())

Result:

    Col1      Col2
0      1  0.304472
1      1  0.304472
2      1  0.304472
3      2  0.883114
4      3  0.381417
5      3  0.381417
6      3  0.381417
7      4  0.668433
8      5  0.365895
9      6  0.484803
10     6  0.484803
11     7  0.403913

This takes about 200 ms for 600K rows on my old laptop computer.

Upvotes: 1

johnDanger
johnDanger

Reputation: 2333

Create a dictionary with random floats for each integer key, and then map Column 2 to the dictionary.

For integers already in Column1, start by making the dictionary:

myInts = df.Column1.unique().tolist()
myFloats = [random.uniform(0,1) for i in range(len(myInts))]

myDictionary = dict(list(zip(myInts , myFloats )))

This will give you:

{0: 0.7361124230574458,
 1: 0.8039650720388128,
 2: 0.7474880952026456,
 3: 0.06792890878546265,
 4: 0.4765215518349696,
 5: 0.8058550699163101,
 6: 0.8865969467094966,
 7: 0.251791893958454,
 8: 0.42261798056239686,
 9: 0.03972320851777933,
....
}

Then map the dictionary keys to Column 1 so that each identical integer gets the same float. Something like:

df.Column2 = df.Column1.map(myDictionary)

More info on how to map a series to a dictionary is here:

Using if/else in pandas series to create new series based on conditions

In this way you can get the desired results without rearranging your dataframe or iterating through it.

Cheers!

Upvotes: 0

Alexandre Daly
Alexandre Daly

Reputation: 320

This isn't totally iteration-free, but you're still only iterating over groups rather than every single row, so it's a touch better:

col1 = [1,1,1,2,3,3,3,4,5,6,6,7] 
col2 = np.random.uniform(0,1,len(set(col1)))

data = np.array([col1])

df1 = pd.DataFrame(data=data) 
df1 = df1.transpose()

df2 = df1.groupby(0)

counter = 0
final_df = pd.DataFrame(columns=[0,1])
for key, item in df2:
    temp_df = df2.get_group(key)
    temp_df[1] = [col2[counter]]*df2.get_group(key).shape[0]
    counter += 1
    final_df = final_df.append(temp_df)

final_df should be the result you're looking for.

Upvotes: 0

Related Questions