Reputation: 1679
I have 2 dataframes, they look like this:
x
start stop data word
0 6.73 7.26 0 Oh,
1 7.27 12.81 1 okay.
2 12.82 12.90 1 Go
3 12.91 14.73 0 ahead
4 14.74 15.03 0 Fred.
5 15.04 15.09 1 I
6 15.10 15.26 1 was,
7 15.27 15.37 1 I
8 15.38 15.51 1 don't
9 15.52 15.75 1 remember
10 15.76 15.87 1 you
11 15.88 16.02 1 want
12 16.03 16.08 1 to
13 16.09 17.10 1 go?
14 17.11 17.58 0 Um,
15 17.59 17.67 0 let
16 17.68 17.76 0 me
17 17.77 21.93 0 think.
18 21.94 22.20 1 Well,
19 22.21 22.56 1 most
y
word metadata
0 Oh, 0
1 okay. 1
2 Go 2
3 ahead 3
4 Fred. 3
5 I 4
6 was, 4
7 I 4
8 don't 4
9 remember 4
10 you 4
11 want 4
12 to 4
13 go? 4
14 Um, 5
15 let 5
16 me 5
17 think. 5
18 Well, 6
19 most 6
I want to merge them together on the column word
. To do so, I am using the following command:
pd.merge(x,y, on='word')
but I get this result:
start stop data word metadata
0 6.73 7.26 0 Oh, 0
1 7.27 12.81 1 okay. 1
2 12.82 12.90 1 Go 2
3 12.91 14.73 0 ahead 3
4 14.74 15.03 0 Fred. 3
5 15.04 15.09 1 I 4
6 15.04 15.09 1 I 4
7 15.27 15.37 1 I 4
8 15.27 15.37 1 I 4
9 15.10 15.26 1 was, 4
10 15.38 15.51 1 don't 4
11 15.52 15.75 1 remember 4
12 15.76 15.87 1 you 4
13 15.88 16.02 1 want 4
14 16.03 16.08 1 to 4
15 16.09 17.10 1 go? 4
16 17.11 17.58 0 Um, 5
17 17.59 17.67 0 let 5
18 17.68 17.76 0 me 5
19 17.77 21.93 0 think. 5
20 21.94 22.20 1 Well, 6
21 22.21 22.56 1 most 6
Why am I getting duplication of some rows? Specifically, look at rows 5:8:
5 15.04 15.09 1 I 4
6 15.04 15.09 1 I 4
7 15.27 15.37 1 I 4
8 15.27 15.37 1 I 4
I've been searching SO for solutions and seeing recommendations to use drop_duplicates(), but this doesn't solve my problem because it messes with the order, as you can see below:
print(pd.merge(x,y, on='word').drop_duplicates())
start stop data word metadata
0 6.73 7.26 0 Oh, 0
1 7.27 12.81 1 okay. 1
2 12.82 12.90 1 Go 2
3 12.91 14.73 0 ahead 3
4 14.74 15.03 0 Fred. 3
5 15.04 15.09 1 I 4
7 15.27 15.37 1 I 4
9 15.10 15.26 1 was, 4
10 15.38 15.51 1 don't 4
11 15.52 15.75 1 remember 4
12 15.76 15.87 1 you 4
13 15.88 16.02 1 want 4
14 16.03 16.08 1 to 4
15 16.09 17.10 1 go? 4
16 17.11 17.58 0 Um, 5
17 17.59 17.67 0 let 5
18 17.68 17.76 0 me 5
19 17.77 21.93 0 think. 5
20 21.94 22.20 1 Well, 6
21 22.21 22.56 1 most 6
How can I get this result below? What am I misunderstanding about pd.merge?
start stop data word metadata
0 6.73 7.26 0 Oh, 0
1 7.27 12.81 1 okay. 1
2 12.82 12.90 1 Go 2
3 12.91 14.73 0 ahead 3
4 14.74 15.03 0 Fred. 3
5 15.04 15.09 1 I 4
6 15.10 15.26 1 was, 4
7 15.27 15.37 1 I 4
8 15.38 15.51 1 don't 4
9 15.52 15.75 1 remember 4
10 15.76 15.87 1 you 4
11 15.88 16.02 1 want 4
12 16.03 16.08 1 to 4
13 16.09 17.10 1 go? 4
14 17.11 17.58 0 Um, 5
15 17.59 17.67 0 let 5
16 17.68 17.76 0 me 5
17 17.77 21.93 0 think. 5
18 21.94 22.20 1 Well, 6
19 22.21 22.56 1 most 6
@YOBEN_S solution almost works. However, when I tried extending this to the whole dataframe, the column metadata
became corrupted when a word was repeated. See below when I extend x and y.
x
start stop data word
5 15.040 15.090 1 I
6 15.100 15.260 1 was,
7 15.270 15.370 1 I
8 15.380 15.510 1 don't
9 15.520 15.750 1 remember
10 15.760 15.870 1 you
11 15.880 16.020 1 want
12 16.030 16.080 1 to
13 16.090 17.100 1 go?
14 17.110 17.580 0 Um,
15 17.590 17.670 0 let
16 17.680 17.760 0 me
17 17.770 21.930 0 think.
18 21.940 22.200 1 Well,
19 22.210 22.560 1 most
20 22.570 22.620 1 of
21 22.630 23.850 1 my
22 23.860 24.330 1 problem
23 24.340 24.540 1 with
24 24.550 25.350 1 this
25 25.360 27.660 1 latest
26 27.670 28.200 1 task
27 28.210 28.320 1 is
28 28.330 28.560 1 just
29 28.570 28.680 1 that
30 28.690 28.770 1 I
31 28.780 29.040 1 felt
y
word metadata
5 I 4
6 was, 4
7 I 4
8 don't 4
9 remember 4
10 you 4
11 want 4
12 to 4
13 go? 4
14 Um, 5
15 let 5
16 me 5
17 think. 5
18 Well, 6
19 most 6
20 of 6
21 my 6
22 problem 6
23 with 6
24 this 6
25 latest 6
26 task 6
27 is 6
28 just 6
29 that 6
30 I 6
31 felt 6
print(x.merge(y.drop_duplicates('word'),on='word',how='left'))
start stop data word m m. metadata
5 15.040 15.090 1 I 4
6 15.100 15.260 1 was, 4
7 15.270 15.370 1 I 4
8 15.380 15.510 1 don't 4
9 15.520 15.750 1 remember 4
10 15.760 15.870 1 you 4
11 15.880 16.020 1 want 4
12 16.030 16.080 1 to 4
13 16.090 17.100 1 go? 4
14 17.110 17.580 0 Um, 5
15 17.590 17.670 0 let 5
16 17.680 17.760 0 me 5
17 17.770 21.930 0 think. 5
18 21.940 22.200 1 Well, 6
19 22.210 22.560 1 most 6
20 22.570 22.620 1 of 6
21 22.630 23.850 1 my 6
22 23.860 24.330 1 problem 6
23 24.340 24.540 1 with 6
24 24.550 25.350 1 this 6
25 25.360 27.660 1 latest 6
26 27.670 28.200 1 task 6
27 28.210 28.320 1 is 6
28 28.330 28.560 1 just 6
29 28.570 28.680 1 that 6
30 28.690 28.770 1 I 4
31 28.780 29.040 1 felt 6
Notice how on row 30, metadata
is 4
when it should be 6
, probably because the first time I
was seen it was associated with metadata
4
. How do I resolve this?
This is what I want:
start stop data word m m. metadata
5 15.040 15.090 1 I 4
6 15.100 15.260 1 was, 4
7 15.270 15.370 1 I 4
8 15.380 15.510 1 don't 4
9 15.520 15.750 1 remember 4
10 15.760 15.870 1 you 4
11 15.880 16.020 1 want 4
12 16.030 16.080 1 to 4
13 16.090 17.100 1 go? 4
14 17.110 17.580 0 Um, 5
15 17.590 17.670 0 let 5
16 17.680 17.760 0 me 5
17 17.770 21.930 0 think. 5
18 21.940 22.200 1 Well, 6
19 22.210 22.560 1 most 6
20 22.570 22.620 1 of 6
21 22.630 23.850 1 my 6
22 23.860 24.330 1 problem 6
23 24.340 24.540 1 with 6
24 24.550 25.350 1 this 6
25 25.360 27.660 1 latest 6
26 27.670 28.200 1 task 6
27 28.210 28.320 1 is 6
28 28.330 28.560 1 just 6
29 28.570 28.680 1 that 6
30 28.690 28.770 1 I 6
31 28.780 29.040 1 felt 6
Upvotes: 0
Views: 83
Reputation: 113
Before performing the below merge you have to assure that primary key of y is 'word'
pd.merge(x,y, on='word')
Since, it is not the case you have to use .drop_duplicate()
y = y.drop_duplicate()
pd.merge(x,y, on='word')
If it doesn't work for the whole dataset then it's likely because one word has multiple values. use groupby and use either of those and delete the other data
Upvotes: 0
Reputation: 323396
You should drop_duplicates
before merge
x.merge(y.drop_duplicates('word'),on='word',how='left')
start stop data word metadata
0 6.73 7.26 0 Oh, 0
1 7.27 12.81 1 okay. 1
2 12.82 12.90 1 Go 2
3 12.91 14.73 0 ahead 3
4 14.74 15.03 0 Fred. 3
5 15.04 15.09 1 I 4
6 15.10 15.26 1 was, 4
7 15.27 15.37 1 I 4
8 15.38 15.51 1 don't 4
9 15.52 15.75 1 remember 4
10 15.76 15.87 1 you 4
11 15.88 16.02 1 want 4
12 16.03 16.08 1 to 4
13 16.09 17.10 1 go? 4
14 17.11 17.58 0 Um, 5
15 17.59 17.67 0 let 5
16 17.68 17.76 0 me 5
17 17.77 21.93 0 think. 5
18 21.94 22.20 1 Well, 6
19 22.21 22.56 1 most 6
Update create another key with cumcount
x.assign(key=x.groupby('word').cumcount()).merge(y.assign(key=y.groupby('word').cumcount()),on=['word','key'],how='left')
start stop data word key metadata
0 15.04 15.09 1 I 0 4
1 15.10 15.26 1 was, 0 4
2 15.27 15.37 1 I 1 4
3 15.38 15.51 1 don't 0 4
4 15.52 15.75 1 remember 0 4
5 15.76 15.87 1 you 0 4
6 15.88 16.02 1 want 0 4
7 16.03 16.08 1 to 0 4
8 16.09 17.10 1 go? 0 4
9 17.11 17.58 0 Um, 0 5
10 17.59 17.67 0 let 0 5
11 17.68 17.76 0 me 0 5
12 17.77 21.93 0 think. 0 5
13 21.94 22.20 1 Well, 0 6
14 22.21 22.56 1 most 0 6
15 22.57 22.62 1 of 0 6
16 22.63 23.85 1 my 0 6
17 23.86 24.33 1 problem 0 6
18 24.34 24.54 1 with 0 6
19 24.55 25.35 1 this 0 6
20 25.36 27.66 1 latest 0 6
21 27.67 28.20 1 task 0 6
22 28.21 28.32 1 is 0 6
23 28.33 28.56 1 just 0 6
24 28.57 28.68 1 that 0 6
25 28.69 28.77 1 I 2 6
26 28.78 29.04 1 felt 0 6
My thoughts maybe simple assign it ? :
x['New']=y.metadata # ?
Upvotes: 1