Reputation: 505
As the title specifies, I'm trying to obtain a dataframe that is row repeated. The factor that decides the N repeats perform are based on the length of the unique values of a specific column present in the original dataframe. After the repeat process is performed, I would like to create a new column that applies all the same unique values of the specific column of the original dataframe to each new row created.
I know that this is a bit confusing, but I can´t try to expose my doubt in a better way. So, to facilitate your understanding of my desired approach, were goes a breif example of my dataframe and the desired output dataframe:
>> Original Dataframe
Samp Age Cs
1 A 51 msi
2 B 62 cin
3 C 55 msi
4 D 70 ebv
5 E 56 gs
....
As you can see, my Cs column has 4 unique values (which may not always be the same for different dataframes). So, my goal is to obtain a dataframe with the following structure:
>> Desired Dataframe
Samp Age Cs
1 A 51 msi
1 A 51 cin
1 A 51 ebv
1 A 51 gs
2 B 62 cin
2 B 62 msi
2 B 62 gs
2 B 62 ebv
3 C 55 msi
3 C 55 cin
3 C 55 ebv
3 C 55 gs
4 D 70 ebv
4 D 70 cin
4 D 70 msi
4 D 70 gs
5 E 56 gs
5 E 56 cin
5 E 56 msi
5 E 56 ebv
....
As you can see, in my desired dataframe, all the rows were repeated 4 times (which is equal to the number of unique Cs column values) in exception the Cs column (which applies all its unique values in different rows).
Upvotes: 2
Views: 1843
Reputation: 51155
Create another DataFrame using numpy.unique
and perform a merge, which will produce the cartesian product of the two frames.
s = pd.Series(np.unique(df.Cs.values)).rename('Cs').to_frame()
pd.merge(
df.iloc[:, :2].assign(key=0),
s.assign(key=0),
on='key'
).drop('key', 1)
Samp Age Cs
0 A 51 cin
1 A 51 ebv
2 A 51 gs
3 A 51 msi
4 B 62 cin
5 B 62 ebv
6 B 62 gs
7 B 62 msi
8 C 55 cin
9 C 55 ebv
10 C 55 gs
11 C 55 msi
12 D 70 cin
13 D 70 ebv
14 D 70 gs
15 D 70 msi
16 E 56 cin
17 E 56 ebv
18 E 56 gs
19 E 56 msi
Timings
tmp = pd.DataFrame({
'Samp': np.arange(10000),
'Age': np.arange(10000),
'Cs': np.repeat(df.Cs, 2000)
})
In [90]: %%timeit
...: s = pd.Series(np.unique(tmp.Cs.values)).rename('Cs').to_frame()
...: pd.merge(
...: tmp.iloc[:, :2].assign(key=0),
...: s.assign(key=0),
...: on='key'
...: ).drop('key', 1)
...:
10.3 ms ± 92.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [91]: %%timeit
...: tmp['Cs'] = tmp['Cs'].astype('category')
...:
...: res = tmp.groupby(['Samp', 'Cs']).first().reset_index()
...: res['Age'] = res.groupby('Samp')['Age'].transform('first').astype(int)
...:
...:
51.5 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 1
Reputation: 164673
One solution is to convert 'Cs'
to a Categorical. Then use GroupBy
+ first
:
df['Cs'] = df['Cs'].astype('category')
res = df.groupby(['Samp', 'Cs']).first().reset_index()
res['Age'] = res.groupby('Samp')['Age'].transform('first').astype(int)
Result
Samp Cs Age
0 A cin 51
1 A ebv 51
2 A gs 51
3 A msi 51
4 B cin 62
5 B ebv 62
6 B gs 62
7 B msi 62
8 C cin 55
9 C ebv 55
10 C gs 55
11 C msi 55
12 D cin 70
13 D ebv 70
14 D gs 70
15 D msi 70
16 E cin 56
17 E ebv 56
18 E gs 56
19 E msi 56
Upvotes: 2
Reputation: 826
You could use the pivot
function and then fillna
both ways to compute your result:
df.pivot('Cs', 'Samp', 'Age').fillna(method='ffill').fillna(method='bfill').unstack().to_frame('Age').reset_index()
Upvotes: 1