João Fernandes
João Fernandes

Reputation: 505

Repeat dataframe rows n times according to the unique column values and to each row repeat create a new column with different values

As the title specifies, I'm trying to obtain a dataframe that is row repeated. The factor that decides the N repeats perform are based on the length of the unique values of a specific column present in the original dataframe. After the repeat process is performed, I would like to create a new column that applies all the same unique values of the specific column of the original dataframe to each new row created.

I know that this is a bit confusing, but I can´t try to expose my doubt in a better way. So, to facilitate your understanding of my desired approach, were goes a breif example of my dataframe and the desired output dataframe:

 >> Original Dataframe

       Samp     Age     Cs
 1       A      51      msi
 2       B      62      cin
 3       C      55      msi
 4       D      70      ebv
 5       E      56      gs
 ....

As you can see, my Cs column has 4 unique values (which may not always be the same for different dataframes). So, my goal is to obtain a dataframe with the following structure:

 >> Desired Dataframe

       Samp     Age     Cs
 1       A      51      msi
 1       A      51      cin
 1       A      51      ebv
 1       A      51      gs
 2       B      62      cin
 2       B      62      msi
 2       B      62      gs
 2       B      62      ebv
 3       C      55      msi
 3       C      55      cin
 3       C      55      ebv
 3       C      55      gs
 4       D      70      ebv
 4       D      70      cin
 4       D      70      msi
 4       D      70      gs
 5       E      56      gs
 5       E      56      cin
 5       E      56      msi
 5       E      56      ebv
 ....

As you can see, in my desired dataframe, all the rows were repeated 4 times (which is equal to the number of unique Cs column values) in exception the Cs column (which applies all its unique values in different rows).

Upvotes: 2

Views: 1843

Answers (3)

user3483203
user3483203

Reputation: 51155

Create another DataFrame using numpy.unique and perform a merge, which will produce the cartesian product of the two frames.

s = pd.Series(np.unique(df.Cs.values)).rename('Cs').to_frame()

pd.merge(
    df.iloc[:, :2].assign(key=0),
    s.assign(key=0),
    on='key'
).drop('key', 1)

   Samp  Age   Cs
0     A   51  cin
1     A   51  ebv
2     A   51   gs
3     A   51  msi
4     B   62  cin
5     B   62  ebv
6     B   62   gs
7     B   62  msi
8     C   55  cin
9     C   55  ebv
10    C   55   gs
11    C   55  msi
12    D   70  cin
13    D   70  ebv
14    D   70   gs
15    D   70  msi
16    E   56  cin
17    E   56  ebv
18    E   56   gs
19    E   56  msi

Timings

tmp = pd.DataFrame({
    'Samp': np.arange(10000),
    'Age': np.arange(10000),
    'Cs': np.repeat(df.Cs, 2000)
})

In [90]: %%timeit
    ...: s = pd.Series(np.unique(tmp.Cs.values)).rename('Cs').to_frame()
    ...: pd.merge(
    ...:     tmp.iloc[:, :2].assign(key=0),
    ...:     s.assign(key=0),
    ...:     on='key'
    ...: ).drop('key', 1)
    ...:
10.3 ms ± 92.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [91]: %%timeit
    ...: tmp['Cs'] = tmp['Cs'].astype('category')
    ...:
    ...: res = tmp.groupby(['Samp', 'Cs']).first().reset_index()
    ...: res['Age'] = res.groupby('Samp')['Age'].transform('first').astype(int)
    ...:
    ...:
51.5 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 1

jpp
jpp

Reputation: 164673

One solution is to convert 'Cs' to a Categorical. Then use GroupBy + first:

df['Cs'] = df['Cs'].astype('category')

res = df.groupby(['Samp', 'Cs']).first().reset_index()
res['Age'] = res.groupby('Samp')['Age'].transform('first').astype(int)

Result

   Samp   Cs  Age
0     A  cin   51
1     A  ebv   51
2     A   gs   51
3     A  msi   51
4     B  cin   62
5     B  ebv   62
6     B   gs   62
7     B  msi   62
8     C  cin   55
9     C  ebv   55
10    C   gs   55
11    C  msi   55
12    D  cin   70
13    D  ebv   70
14    D   gs   70
15    D  msi   70
16    E  cin   56
17    E  ebv   56
18    E   gs   56
19    E  msi   56

Upvotes: 2

Alex
Alex

Reputation: 826

You could use the pivot function and then fillna both ways to compute your result:

df.pivot('Cs', 'Samp', 'Age').fillna(method='ffill').fillna(method='bfill').unstack().to_frame('Age').reset_index()

Upvotes: 1

Related Questions