Peaceful
Peaceful

Reputation: 5480

Replicate the data in pandas dataframe

I have some data in a dataframe df whose length is n and I am creating a larger dataframe dg whose length is say 10n. I want to copy data from df to dg so that the rows in dg would be periodically filled by the data in df. I tried following:

dg = pd.DataFrame(index = range(10*n), columns = columns)

for i in range(0, 10*n, n):
    for j in range(n):
        dg[col][i : i+n] = df[col][0:n]

However, this is extremely slow. Is there any faster way to achieve the same? Ideally, I would love to see a solution in which I can simply take df and extend its length to 10n so that all the data would simply be copied periodically.

Upvotes: 1

Views: 1334

Answers (2)

piRSquared
piRSquared

Reputation: 294508

Consider the dataframe df

np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(4, 5), columns=list('abcde'))
df

          a         b         c         d         e
0  0.444939  0.407554  0.460148  0.465239  0.462691
1  0.016545  0.850445  0.817744  0.777962  0.757983
2  0.934829  0.831104  0.879891  0.926879  0.721535
3  0.117642  0.145906  0.199844  0.437564  0.100702

pandas

Using iloc

r = np.arange(len(df)).repeat(3)
df.iloc[r].reset_index(drop=True)

           a         b         c         d         e
0   0.444939  0.407554  0.460148  0.465239  0.462691
1   0.444939  0.407554  0.460148  0.465239  0.462691
2   0.444939  0.407554  0.460148  0.465239  0.462691
3   0.016545  0.850445  0.817744  0.777962  0.757983
4   0.016545  0.850445  0.817744  0.777962  0.757983
5   0.016545  0.850445  0.817744  0.777962  0.757983
6   0.934829  0.831104  0.879891  0.926879  0.721535
7   0.934829  0.831104  0.879891  0.926879  0.721535
8   0.934829  0.831104  0.879891  0.926879  0.721535
9   0.117642  0.145906  0.199844  0.437564  0.100702
10  0.117642  0.145906  0.199844  0.437564  0.100702
11  0.117642  0.145906  0.199844  0.437564  0.100702

numpy

r = np.arange(len(df)).repeat(3)
pd.DataFrame(df.values[r], columns=df.columns)

           a         b         c         d         e
0   0.444939  0.407554  0.460148  0.465239  0.462691
1   0.444939  0.407554  0.460148  0.465239  0.462691
2   0.444939  0.407554  0.460148  0.465239  0.462691
3   0.016545  0.850445  0.817744  0.777962  0.757983
4   0.016545  0.850445  0.817744  0.777962  0.757983
5   0.016545  0.850445  0.817744  0.777962  0.757983
6   0.934829  0.831104  0.879891  0.926879  0.721535
7   0.934829  0.831104  0.879891  0.926879  0.721535
8   0.934829  0.831104  0.879891  0.926879  0.721535
9   0.117642  0.145906  0.199844  0.437564  0.100702
10  0.117642  0.145906  0.199844  0.437564  0.100702
11  0.117642  0.145906  0.199844  0.437564  0.100702

time test

enter image description here

Upvotes: 0

Vikash Singh
Vikash Singh

Reputation: 14011

if you don't care about order then this should work:

import pandas as pd
x = pd.DataFrame({"data": [1,2]})
df = pd.concat([x]*5, ignore_index=True)
df

output:

    data
0   1
1   2
2   1
3   2
4   1
.
.

if you care about the order then you can go with this approach:

import numpy as np
df = x.loc[np.repeat(x.index.values, 3)]
df

output:

    data
0   1
0   1
0   1
1   2
1   2
1   2

Upvotes: 2

Related Questions