Cristian FL
Cristian FL

Reputation: 143

Optimum way to concatenate new metadata and replicate to all rows (Pandas)

I'm new to Pandas. This should be easy but my approach doesn't seem optimum.

I have a dummy data frame with several rows and columns. Next, I want to include new columns with some metadata that have single values (one row). I want to include this as new columns and I want to replicate these values to all existing rows (this will make much easier doing some later operations with the data).

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                   index=[0, 1, 2, 3])

df2 = pd.DataFrame({'LAT': ['LAT0'],
                    'LON': ['LON0'],
                    'TIME': ['T0']},
                   index=[0])

I want a single data frame with 4 rows and 7 columns (LAT,LON,TIME,A,B,C,D). I'd like the values from LAT, LON and TIME repeated for every single row.

My approach is:

df.insert(loc=0, column=['LAT'], value=df2['LAT'][0])

and repeat for each variable in df2. That seems silly and not the right way to do this. Additionally, I can:

df2.append(df1)

But that does not replicate the values. It fills with NaNs.

What is the optimum way to carry out this simple concatenation and replication of single-row data?

Upvotes: 3

Views: 176

Answers (2)

jezrael
jezrael

Reputation: 863166

Simpliest is use assign with dictionary unpacking with ** for add new columns, but is necessary one word strings columns:

df1 = df1.assign(**df2.iloc[0])
print (df1)
    A   B   C   D   LAT   LON TIME
0  A0  B0  C0  D0  LAT0  LON0   T0
1  A1  B1  C1  D1  LAT0  LON0   T0
2  A2  B2  C2  D2  LAT0  LON0   T0
3  A3  B3  C3  D3  LAT0  LON0   T0

Another solution for prepend columns is use reindex with join:

df1 = df2.iloc[[0]].reindex(df1.index, method='ffill').join(df1)
print (df1)

    LAT   LON TIME   A   B   C   D
0  LAT0  LON0   T0  A0  B0  C0  D0
1  LAT0  LON0   T0  A1  B1  C1  D1
2  LAT0  LON0   T0  A2  B2  C2  D2
3  LAT0  LON0   T0  A3  B3  C3  D3

Very similar with DataFrame contructor:

df3 = pd.DataFrame(df2.iloc[0].to_dict(), index=df1.index)
print (df3)
    LAT   LON TIME
0  LAT0  LON0   T0
1  LAT0  LON0   T0
2  LAT0  LON0   T0
3  LAT0  LON0   T0

df1 = df3.join(df1)
print (df1)
    LAT   LON TIME   A   B   C   D
0  LAT0  LON0   T0  A0  B0  C0  D0
1  LAT0  LON0   T0  A1  B1  C1  D1
2  LAT0  LON0   T0  A2  B2  C2  D2
3  LAT0  LON0   T0  A3  B3  C3  D3

Another numpy solution with numpy.broadcast_to - only be careful if not all columns have same types like strings, there should be some casting applied:

df3 = pd.DataFrame(np.broadcast_to(df2.values, (len(df1),len(df2.columns))),
                   columns=df2.columns, index=df1.index)
print (df3)
    LAT   LON TIME
0  LAT0  LON0   T0
1  LAT0  LON0   T0
2  LAT0  LON0   T0
3  LAT0  LON0   T0

df1 = df3.join(df1)
print (df1)

    LAT   LON TIME   A   B   C   D
0  LAT0  LON0   T0  A0  B0  C0  D0
1  LAT0  LON0   T0  A1  B1  C1  D1
2  LAT0  LON0   T0  A2  B2  C2  D2
3  LAT0  LON0   T0  A3  B3  C3  D3

Performance:

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                   index=[0, 1, 2, 3])

#[400000 rows x 4 columns]
df1 = pd.concat([df1] * 100000, ignore_index=True)

df2 = pd.DataFrame({'LAT': ['LAT0'],
                    'LON': ['LON0'],
                    'TIME': ['T0']},
                    index=[0])



In [286]: %timeit df1.assign(**df2.iloc[0])
23 ms ± 642 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [287]: %timeit df2.iloc[[0]].reindex(df1.index, method='ffill').join(df1)
35.7 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [288]: %timeit pd.DataFrame(df2.iloc[0].to_dict(), index=df1.index).join(df1)
54.7 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [289]: %timeit pd.DataFrame(np.broadcast_to(df2.values, (len(df1),len(df2.columns))), columns=df2.columns, index=df1.index).join(df1)
27.8 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#bunji solution
In [290]: %timeit df1.join(df2, how='outer').fillna(method='ffill')
244 ms ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 4

bunji
bunji

Reputation: 5223

Another option is:

df = df1.join(df2, how='outer').fillna(method='ffill')
print(df)

    A   B   C   D   LAT   LON TIME
0  A0  B0  C0  D0  LAT0  LON0   T0
1  A1  B1  C1  D1  LAT0  LON0   T0
2  A2  B2  C2  D2  LAT0  LON0   T0
3  A3  B3  C3  D3  LAT0  LON0   T0

Note that the how='outer' is only really necessary if df1 has fewer rows than df2 since join does a left join by default.

Upvotes: 1

Related Questions