Eric Kim
Eric Kim

Reputation: 2698

Python Pandas Dataframe, merge df rows while skipping overlapping rows

The title doesn't make a clear sense, but it was the best I could describe.

I have DataFrames that look like these:

data1 = np.matrix([[4,75,2,5,84,2,6,5,554],[4,6,67,6,4,5,8,5,8]]).T
data2 = np.matrix([[3,46,4,555,556,557,558,559,560],[1,2,4,1,3,5,3,1,5]]).T

data1 = pd.DataFrame(data1)
data2 = pd.DataFrame(data2)

>>> data1
     0   1
0    4   4
1   75   6
2    2  67
3    5   6
4   84   4
5    2   5
6    6   8
7    5   5
8  554   8


>>> data2
     0  1
0    3  1
1   46  2
2    4  4
3  555  1
4  556  3
5  557  5
6  558  3
7  559  1
8  560  5

I want to append data2 to the bottom of data1. However, I want to append rows of data2 whose column1 values are greater than or equal to the 554, which is the last row of column1 of data1

Here is the output I want:

>>> merged_df
         0   1
    0    4   4
    1   75   6
    2    2  67
    3    5   6
    4   84   4
    5    2   5
    6    6   8
    7    5   5
    8  554   8
    9  555   1
   10  556   3
   11  557   5
   12  558   3
   13  559   1
   14  560   5

So, some of the first rows of data2 will be skipped when appending to data1.

It is assumed that the last row of data1 is its maximum, and the rows of data2 are sorted after the last value of data1, which is 554

Is there any elegant way to do this job using Pandas toolbox?

Upvotes: 3

Views: 719

Answers (1)

jezrael
jezrael

Reputation: 862541

Use concat with filter second DataFrame by boolean indexing:

print (data1.iloc[-1, 0])
554

df = pd.concat([data1, data2[data2[0] > data1.iloc[-1, 0]]], ignore_index=True)
print (df)
      0   1
0     4   4
1    75   6
2     2  67
3     5   6
4    84   4
5     2   5
6     6   8
7     5   5
8   554   8
9   555   1
10  556   3
11  557   5
12  558   3
13  559   1
14  560   5

Also for general solution compare by max value:

df = pd.concat([data1, data2[data2[0] > data1[0].max()]], ignore_index=True)

Solution if custom columns names:

data1 = pd.DataFrame(data1, columns=list('ab'))
data2 = pd.DataFrame(data2, columns=list('ab'))

print (data1.iloc[-1, data1.columns.get_loc('a')])
554

data22 = data2[data2['a'] > data1.iloc[-1, data1.columns.get_loc('a')]]
df = pd.concat([data1, data22], ignore_index=True)
print (df)
      a   b
0     4   4
1    75   6
2     2  67
3     5   6
4    84   4
5     2   5
6     6   8
7     5   5
8   554   8
9   555   1
10  556   3
11  557   5
12  558   3
13  559   1
14  560   5

Upvotes: 3

Related Questions