Reputation: 12509
I have one dataframe, A
, that looks like this:
+---+------------+-----------+-----------+-----+-------+
| | time | uid | o_uid | msg | count |
+---+------------+-----------+-----------+-----+-------+
| 0 | 1433131357 | 191470529 | 191159572 | eis | 1 |
| 1 | 1433131410 | 191458009 | 160429326 | eis | 1 |
| 2 | 1433131504 | 191470523 | 153734142 | eis | 1 |
| 3 | 1433131685 | 191470551 | 191470546 | eis | 1 |
| 4 | 1433131782 | 191470565 | 187367195 | eis | 1 |
+---+------------+-----------+-----------+-----+-------+
And another dataframe, B
, that looks like this:
+---+------------+-----------+-------+
| | time | uid | count |
+---+------------+-----------+-------+
| 0 | 1433131967 | 191470529 | 1 |
| 1 | 1433132503 | 191466638 | 1 |
| 2 | 1433139333 | 191451858 | 1 |
| 3 | 1433141249 | 191470551 | 1 |
| 4 | 1433143867 | 191471209 | 1 |
+---+------------+-----------+-------+
What I would like to do is take all that timestamps from B
and place them in a column in A
where the UIDs
match. Where there is no match, there should be a NaN
I tried this: df = pd.merge(A, B, left_on='uid', right_on='uid', how='outer')
But I think it only appended B
to the bottom of A
. It did not work as expected.
Upvotes: 0
Views: 56
Reputation: 24742
I think left_join
is most appropriate in your case. This can be achieved by setting how=left
import pandas as pd
# your data
# ============================
print(df_A)
Out[33]:
time uid o_uid msg count
0 1433131357 191470529 191159572 eis 1
1 1433131410 191458009 160429326 eis 1
2 1433131504 191470523 153734142 eis 1
3 1433131685 191470551 191470546 eis 1
4 1433131782 191470565 187367195 eis 1
print(df_B)
Out[35]:
time uid count
0 1433131967 191470529 1
1 1433132503 191466638 1
2 1433139333 191451858 1
3 1433141249 191470551 1
4 1433143867 191471209 1
# processing
# ============================
df = pd.merge(df_A, df_B, left_on='uid', right_on='uid',how='left', suffixes=['_A', '_B'])
Out[45]:
time_A uid o_uid msg count_A time_B count_B
0 1433131357 191470529 191159572 eis 1 1.4331e+09 1
1 1433131410 191458009 160429326 eis 1 NaN NaN
2 1433131504 191470523 153734142 eis 1 NaN NaN
3 1433131685 191470551 191470546 eis 1 1.4331e+09 1
4 1433131782 191470565 187367195 eis 1 NaN NaN
Upvotes: 2
Reputation: 109528
Both your time and count columns overlap in both DataFrames, so you need to supply suffixes
as a parameter. In the example below, I use a null suffix for df_a
and '_b'
as the suffix for df_b
.
import pandas as pd
df_a = pd.DataFrame({'count': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'msg': {0: 'eis', 1: 'eis', 2: 'eis', 3: 'eis', 4: 'eis'},
'o_uid': {0: 191159572, 1: 160429326, 2: 153734142, 3: 191470546, 4: 187367195},
'time': {0: 1433131357, 1: 1433131410, 2: 1433131504, 3: 1433131685, 4: 1433131782},
'uid': {0: 191470529, 1: 191458009, 2: 191470523, 3: 191470551, 4: 191470565}})
df_b = pd.DataFrame({'count': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'time': {0: 1433131967, 1: 1433132503, 2: 1433139333, 3: 1433141249, 4: 1433143867},
'uid': {0: 191470529, 1: 191466638, 2: 191451858, 3: 191470551, 4: 191471209}})
>>> df_a.merge(df_b, how='outer', on='uid', suffixes=['', '_b'])
count msg o_uid time uid count_b time_b
0 1 eis 191159572 1433131357 191470529 1 1433131967
1 1 eis 160429326 1433131410 191458009 NaN NaN
2 1 eis 153734142 1433131504 191470523 NaN NaN
3 1 eis 191470546 1433131685 191470551 1 1433141249
4 1 eis 187367195 1433131782 191470565 NaN NaN
5 NaN NaN NaN NaN 191466638 1 1433132503
6 NaN NaN NaN NaN 191451858 1 1433139333
7 NaN NaN NaN NaN 191471209 1 1433143867
Upvotes: 1