Neal Kruis
Neal Kruis

Reputation: 2287

Conditional merge of other multiple Pandas DataFrames

I have four pandas DataFrames (A, B, C, and D). A has a series of timestamps and a single column which refers to one of the other DataFrames:

A

Timestamp    Source
-----------  ------
2012-4-3     B
2013-12-20   C
2012-3-5     C
2014-12-7    D
2012-7-10    B
...

The other DataFrames hold more data:

B

Timestamp   Foo  Bar
----------- ---- ----
2012-1-1    1.5  1.3
2012-1-2    2.3  5.6
2012-1-3    3.4  3.3
...
2014-3-31   0.8  2.1

C

Timestamp   Foo  Bar
----------- ---- ----
2012-1-1    9.2  5.6
2012-1-2    4.8  7.6
2012-1-3    2.7  6.4
...
2014-3-31   7.0  6.5

D

Timestamp   Foo  Bar
----------- ---- ----
2012-1-1    6.8  4.2
2012-1-2    4.2  9.3
2012-1-3    5.5  0.7
...
2014-3-31   6.3  2.0

I want to construct a single DataFrame from A, B, C, and D that has three columns (Timestamp, Foo, and Bar) where the values of Foo and Bar come from the corresponding Timestamp in the DataFrame listed as the Source in A.

Not all Timestamps in A appear in the other three DataFrames, in which case, I'd like the values of Foo and Bar to be np.nan. Not all timestamps in B, C, and D appear in A, and simply won't appear in the final DataFrame.

My current approach is to loop through each row in A and return the values from the corresponding Source DataFrame:

srcs = {'B': B, 'C': C, 'D': D}
A['Foo'] = np.nan
A['Bar'] = np.nan

for i in range(len(A)):
    ts = A.iloc[i].Timestamp
    src = A.iloc[i].Source
    A.iloc[i].Foo = srcs[src][srcs[src].Timestamp == ts].Foo
    A.iloc[i].Bar = srcs[src][srcs[src].Timestamp == ts].Bar

There has to be a more efficient, more Pandithic(?) way to perform this action?

Upvotes: 0

Views: 385

Answers (2)

piRSquared
piRSquared

Reputation: 294338

Setup

import pandas as pd
from StringIO import StringIO

texta = """Timestamp    Source
2012-4-3     B
2012-4-2     B
2013-12-20   C
2012-3-5     C
2014-12-7    D
2012-7-10    B"""

A = pd.read_csv(StringIO(texta), delim_whitespace=1, parse_dates=[0])

textb = """Timestamp   Foo  Bar
2012-1-1    1.5  1.3
2012-4-3    3.1  4.1
2012-1-2    2.3  5.6
2012-1-3    3.4  3.3
2014-3-31   0.8  2.1"""

B = pd.read_csv(StringIO(textb), delim_whitespace=1, parse_dates=[0])

textc = """Timestamp   Foo  Bar
2012-1-1    9.2  5.6
2012-3-5    4.8  7.6
2012-1-2    4.8  7.6
2012-1-3    2.7  6.4
2014-3-31   7.0  6.5"""

C = pd.read_csv(StringIO(textc), delim_whitespace=1, parse_dates=[0])

textd = """Timestamp   Foo  Bar
2012-1-1    6.8  4.2
2012-1-2    4.2  9.3
2012-1-3    5.5  0.7
2014-3-31   6.3  2.0"""

D = pd.read_csv(StringIO(textd), delim_whitespace=1, parse_dates=[0])

Then I combine with pd.concat just B C and D

bdf = pd.concat([B, C, D], keys=['B', 'C', 'D'])
bdf.reset_index(level=1, inplace=1, drop=1)
bdf.index.name = 'Source'
bdf.reset_index(inplace=1)

print bdf

It looks like this:

   Source  Timestamp  Foo  Bar
0       B 2012-01-01  1.5  1.3
1       B 2012-04-03  3.1  4.1
2       B 2012-01-02  2.3  5.6
3       B 2012-01-03  3.4  3.3
4       B 2014-03-31  0.8  2.1
5       C 2012-01-01  9.2  5.6
6       C 2012-03-05  4.8  7.6
7       C 2012-01-02  4.8  7.6
8       C 2012-01-03  2.7  6.4
9       C 2014-03-31  7.0  6.5
10      D 2012-01-01  6.8  4.2
11      D 2012-01-02  4.2  9.3
12      D 2012-01-03  5.5  0.7
13      D 2014-03-31  6.3  2.0

Finally

a simple merge

A.merge(bdf, how='left')

Looks like:

   Timestamp Source  Foo  Bar
0 2012-04-03      B  3.1  4.1
1 2012-04-02      B  NaN  NaN
2 2013-12-20      C  NaN  NaN
3 2012-03-05      C  4.8  7.6
4 2014-12-07      D  NaN  NaN
5 2012-07-10      B  NaN  NaN

Upvotes: 1

Brad Campbell
Brad Campbell

Reputation: 3071

It looks like you could use a multi-index to do this. Your index would be consist of the Timestamp and Source. You can so this with the set_index method on a DataFrame.

Here is some code to create some fake DataFrames, each with the MultiIndex.

# Imports for creating fake data
from random import random
from random import choice

# Setup the sample data
A = pd.DataFrame({'TimeStamp':range(20), 'Source':[choice(others) for i in range(20)]})
# Create the MultiIndex on A
A.set_index(['TimeStamp', 'Source'], inplace=True)
A['Bar'] = [np.nan] * len(A)
A['Foo'] = [np.nan] * len(A)

B = pd.DataFrame({'TimeStamp':range(5), 
                  'Foo':[random()*5+5 for i in range(5)], 
                  'Bar':[random()*5+5 for i in range(5)]})
C = pd.DataFrame({'TimeStamp':range(5,10), 
                  'Foo':[random()*5+5 for i in range(5)], 
                  'Bar':[random()*5+5 for i in range(5)]})
D = pd.DataFrame({'TimeStamp':range(10,15), 
                  'Foo':[random()*5+5 for i in range(5)], 
                  'Bar':[random()*5+5 for i in range(5)]})

sources = {'B':B, 'C':C, 'D':D}

# create the MultiIndex on the Source data sets
for s, df in sources.items():
    df['Source'] = [s]*len(df)
    df.set_index(['TimeStamp', 'Source'], inplace=True)

Now you can index the Source data sets (B, C, and D) using the index on A.

for s, df in sources.items():    

    temp = df.loc[A.index]  # the source data set indexed by A's index
                            # this will contain NaN's where df does not
                            # have corresponding index entries
    temp.dropna(inplace=True) # dropping the NaN values leaves you with 
                             # only the values in df matching the index in A
    if len(temp) > 0:
        A.loc[temp.index] = temp  # now assign the data to A

print(A)

The result looks like:

                       Bar       Foo
TimeStamp Source                    
0         D            NaN       NaN
1         C            NaN       NaN
2         D            NaN       NaN
3         B       7.927154  8.581380
4         B       7.638422  5.970348
5         D            NaN       NaN
6         C       6.938001  6.417248
7         B            NaN       NaN
8         C       5.131940  9.144621
9         B            NaN       NaN
10        D       9.186963  5.991877
11        D       8.070543  7.735040
12        C            NaN       NaN
13        B            NaN       NaN
14        C            NaN       NaN
15        D            NaN       NaN
16        C            NaN       NaN
17        C            NaN       NaN
18        C            NaN       NaN
19        B            NaN       NaN

Upvotes: 2

Related Questions