Rodrigo Bonadia
Rodrigo Bonadia

Reputation: 125

Pandas: select value from random column on each row

Suppose I have the following Pandas DataFrame:

df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [4, 5, 6],
    'c': [7, 8, 9]
})
    a   b   c
0   1   4   7
1   2   5   8
2   3   6   9

I want to generate a new pandas.Series so that the values of this series are selected, row by row, from a random column in the DataFrame. So, a possible output for that would be the series:

0    7
1    2
2    9
dtype: int64

(where in row 0 it randomly chose 'c', in row 1 it randomly chose 'a' and in row 2 it randomly chose 'c' again).

I know this can be done by iterating over the rows and using random.choice to choose each row, but iterating over the rows not only has bad performance but also is "unpandonic", so to speak. Also, df.sample(axis=1) would choose whole columns, so all of them would be chosen from the same column, which is not what I want. Is there a better way to do this with vectorized pandas methods?

Upvotes: 5

Views: 4737

Answers (5)

jfaccioni
jfaccioni

Reputation: 7509

You're probably still going to need to iterate through each row while selecting a random value in each row - whether you do it explicitly with a for loop or implicitly with whatever function you decide to call.

You can, however, simplify the to a single line using a list comprehension, if it suits your style:

result = pd.Series([random.choice(pd.iloc[i]) for i in range(len(df))])

Upvotes: 1

mujjiga
mujjiga

Reputation: 16866

pd.DataFrame(
    df.values[range(df.shape[0]), 
                   np.random.randint(
                       0, df.shape[1], size=df.shape[0])])

output

    0
0   4
1   5
2   9

Upvotes: 1

sjw
sjw

Reputation: 6543

Here is a fully vectorized solution. Note however that it does not use Pandas methods, but rather involves operations on the underlying numpy array.

import numpy as np

indices = np.random.choice(np.arange(len(df.columns)), len(df), replace=True)

Example output is [1, 2, 1] which corresponds to ['b', 'c', 'b'].

Then use this to slice the numpy array:

df['random'] = df.to_numpy()[np.arange(len(df)), indices]

Results:

   a  b  c  random
0  1  4  7       7
1  2  5  8       5
2  3  6  9       9

Upvotes: 5

Valentino
Valentino

Reputation: 7361

This does the job (using the built-in module random):

ddf = df.apply(lambda row : random.choice(row.tolist()), axis=1)

or using pandas sample:

ddf = df.apply(lambda row : row.sample(), axis=1)

Both have the same behaviour. ddf is your Series.

Upvotes: 2

anky
anky

Reputation: 75080

May be something like:

pd.Series([np.random.choice(i,1)[0] for i in df.values])

Upvotes: 5

Related Questions