connor449
connor449

Reputation: 1679

Pandas merge causing unwanted duplicates

I have 2 dataframes, they look like this:

x

    start   stop           data      word
0    6.73   7.26              0       Oh,
1    7.27  12.81              1     okay.
2   12.82  12.90              1        Go
3   12.91  14.73              0     ahead
4   14.74  15.03              0     Fred.
5   15.04  15.09              1         I
6   15.10  15.26              1      was,
7   15.27  15.37              1         I
8   15.38  15.51              1     don't
9   15.52  15.75              1  remember
10  15.76  15.87              1       you
11  15.88  16.02              1      want
12  16.03  16.08              1        to
13  16.09  17.10              1       go?
14  17.11  17.58              0       Um,
15  17.59  17.67              0       let
16  17.68  17.76              0        me
17  17.77  21.93              0    think.
18  21.94  22.20              1     Well,
19  22.21  22.56              1      most
y

        word         metadata
0        Oh,                0
1      okay.                1
2         Go                2
3      ahead                3
4      Fred.                3
5          I                4
6       was,                4
7          I                4
8      don't                4
9   remember                4
10       you                4
11      want                4
12        to                4
13       go?                4
14       Um,                5
15       let                5
16        me                5
17    think.                5
18     Well,                6
19      most                6

I want to merge them together on the column word. To do so, I am using the following command:

pd.merge(x,y, on='word')

but I get this result:

    start   stop           data       word         metadata
0    6.73   7.26              0       Oh,                0
1    7.27  12.81              1     okay.                1
2   12.82  12.90              1        Go                2
3   12.91  14.73              0     ahead                3
4   14.74  15.03              0     Fred.                3
5   15.04  15.09              1         I                4
6   15.04  15.09              1         I                4
7   15.27  15.37              1         I                4
8   15.27  15.37              1         I                4
9   15.10  15.26              1      was,                4
10  15.38  15.51              1     don't                4
11  15.52  15.75              1  remember                4
12  15.76  15.87              1       you                4
13  15.88  16.02              1      want                4
14  16.03  16.08              1        to                4
15  16.09  17.10              1       go?                4
16  17.11  17.58              0       Um,                5
17  17.59  17.67              0       let                5
18  17.68  17.76              0        me                5
19  17.77  21.93              0    think.                5
20  21.94  22.20              1     Well,                6
21  22.21  22.56              1      most                6

Why am I getting duplication of some rows? Specifically, look at rows 5:8:

5   15.04  15.09              1         I                4
6   15.04  15.09              1         I                4
7   15.27  15.37              1         I                4
8   15.27  15.37              1         I                4

I've been searching SO for solutions and seeing recommendations to use drop_duplicates(), but this doesn't solve my problem because it messes with the order, as you can see below:

print(pd.merge(x,y, on='word').drop_duplicates())
    start   stop            data      word         metadata
0    6.73   7.26              0       Oh,                0
1    7.27  12.81              1     okay.                1
2   12.82  12.90              1        Go                2
3   12.91  14.73              0     ahead                3
4   14.74  15.03              0     Fred.                3
5   15.04  15.09              1         I                4
7   15.27  15.37              1         I                4
9   15.10  15.26              1      was,                4
10  15.38  15.51              1     don't                4
11  15.52  15.75              1  remember                4
12  15.76  15.87              1       you                4
13  15.88  16.02              1      want                4
14  16.03  16.08              1        to                4
15  16.09  17.10              1       go?                4
16  17.11  17.58              0       Um,                5
17  17.59  17.67              0       let                5
18  17.68  17.76              0        me                5
19  17.77  21.93              0    think.                5
20  21.94  22.20              1     Well,                6
21  22.21  22.56              1      most                6

How can I get this result below? What am I misunderstanding about pd.merge?

   start   stop         data      word   metadata
0    6.73   7.26              0       Oh,     0
1    7.27  12.81              1     okay.     1
2   12.82  12.90              1        Go     2
3   12.91  14.73              0     ahead     3
4   14.74  15.03              0     Fred.     3
5   15.04  15.09              1         I     4
6   15.10  15.26              1      was,     4
7   15.27  15.37              1         I     4
8   15.38  15.51              1     don't     4
9   15.52  15.75              1  remember     4
10  15.76  15.87              1       you     4
11  15.88  16.02              1      want     4
12  16.03  16.08              1        to     4
13  16.09  17.10              1       go?     4
14  17.11  17.58              0       Um,     5
15  17.59  17.67              0       let     5
16  17.68  17.76              0        me     5
17  17.77  21.93              0    think.     5
18  21.94  22.20              1     Well,     6
19  22.21  22.56              1      most     6

EDIT

@YOBEN_S solution almost works. However, when I tried extending this to the whole dataframe, the column metadata became corrupted when a word was repeated. See below when I extend x and y.

x
     start    stop           data       word
5   15.040  15.090              1          I
6   15.100  15.260              1       was,
7   15.270  15.370              1          I
8   15.380  15.510              1      don't
9   15.520  15.750              1   remember
10  15.760  15.870              1        you
11  15.880  16.020              1       want
12  16.030  16.080              1         to
13  16.090  17.100              1        go?
14  17.110  17.580              0        Um,
15  17.590  17.670              0        let
16  17.680  17.760              0         me
17  17.770  21.930              0     think.
18  21.940  22.200              1      Well,
19  22.210  22.560              1       most
20  22.570  22.620              1         of
21  22.630  23.850              1         my
22  23.860  24.330              1    problem
23  24.340  24.540              1       with
24  24.550  25.350              1       this
25  25.360  27.660              1     latest
26  27.670  28.200              1       task
27  28.210  28.320              1         is
28  28.330  28.560              1       just
29  28.570  28.680              1       that
30  28.690  28.770              1          I
31  28.780  29.040              1       felt
y
        word            metadata
5           I                4
6        was,                4
7           I                4
8       don't                4
9    remember                4
10        you                4
11       want                4
12         to                4
13        go?                4
14        Um,                5
15        let                5
16         me                5
17     think.                5
18      Well,                6
19       most                6
20         of                6
21         my                6
22    problem                6
23       with                6
24       this                6
25     latest                6
26       task                6
27         is                6
28       just                6
29       that                6
30          I                6
31       felt                6

print(x.merge(y.drop_duplicates('word'),on='word',how='left'))

    start    stop             data       word m m.    metadata
5   15.040  15.090              1          I                4
6   15.100  15.260              1       was,                4
7   15.270  15.370              1          I                4
8   15.380  15.510              1      don't                4
9   15.520  15.750              1   remember                4
10  15.760  15.870              1        you                4
11  15.880  16.020              1       want                4
12  16.030  16.080              1         to                4
13  16.090  17.100              1        go?                4
14  17.110  17.580              0        Um,                5
15  17.590  17.670              0        let                5
16  17.680  17.760              0         me                5
17  17.770  21.930              0     think.                5
18  21.940  22.200              1      Well,                6
19  22.210  22.560              1       most                6
20  22.570  22.620              1         of                6
21  22.630  23.850              1         my                6
22  23.860  24.330              1    problem                6
23  24.340  24.540              1       with                6
24  24.550  25.350              1       this                6
25  25.360  27.660              1     latest                6
26  27.670  28.200              1       task                6
27  28.210  28.320              1         is                6
28  28.330  28.560              1       just                6
29  28.570  28.680              1       that                6
30  28.690  28.770              1          I                4
31  28.780  29.040              1       felt                6
​

Notice how on row 30, metadata is 4 when it should be 6, probably because the first time I was seen it was associated with metadata 4. How do I resolve this?

This is what I want:

    start    stop             data       word m m.    metadata
5   15.040  15.090              1          I                4
6   15.100  15.260              1       was,                4
7   15.270  15.370              1          I                4
8   15.380  15.510              1      don't                4
9   15.520  15.750              1   remember                4
10  15.760  15.870              1        you                4
11  15.880  16.020              1       want                4
12  16.030  16.080              1         to                4
13  16.090  17.100              1        go?                4
14  17.110  17.580              0        Um,                5
15  17.590  17.670              0        let                5
16  17.680  17.760              0         me                5
17  17.770  21.930              0     think.                5
18  21.940  22.200              1      Well,                6
19  22.210  22.560              1       most                6
20  22.570  22.620              1         of                6
21  22.630  23.850              1         my                6
22  23.860  24.330              1    problem                6
23  24.340  24.540              1       with                6
24  24.550  25.350              1       this                6
25  25.360  27.660              1     latest                6
26  27.670  28.200              1       task                6
27  28.210  28.320              1         is                6
28  28.330  28.560              1       just                6
29  28.570  28.680              1       that                6
30  28.690  28.770              1          I                6
31  28.780  29.040              1       felt                6

Upvotes: 0

Views: 83

Answers (2)

siddharth ranjan
siddharth ranjan

Reputation: 113

Before performing the below merge you have to assure that primary key of y is 'word'

pd.merge(x,y, on='word')

Since, it is not the case you have to use .drop_duplicate()

y = y.drop_duplicate()
pd.merge(x,y, on='word')

If it doesn't work for the whole dataset then it's likely because one word has multiple values. use groupby and use either of those and delete the other data

Upvotes: 0

BENY
BENY

Reputation: 323396

You should drop_duplicates before merge

x.merge(y.drop_duplicates('word'),on='word',how='left')
    start   stop  data      word  metadata
0    6.73   7.26     0       Oh,         0
1    7.27  12.81     1     okay.         1
2   12.82  12.90     1        Go         2
3   12.91  14.73     0     ahead         3
4   14.74  15.03     0     Fred.         3
5   15.04  15.09     1         I         4
6   15.10  15.26     1      was,         4
7   15.27  15.37     1         I         4
8   15.38  15.51     1     don't         4
9   15.52  15.75     1  remember         4
10  15.76  15.87     1       you         4
11  15.88  16.02     1      want         4
12  16.03  16.08     1        to         4
13  16.09  17.10     1       go?         4
14  17.11  17.58     0       Um,         5
15  17.59  17.67     0       let         5
16  17.68  17.76     0        me         5
17  17.77  21.93     0    think.         5
18  21.94  22.20     1     Well,         6
19  22.21  22.56     1      most         6

Update create another key with cumcount

x.assign(key=x.groupby('word').cumcount()).merge(y.assign(key=y.groupby('word').cumcount()),on=['word','key'],how='left')
    start   stop  data      word  key  metadata
0   15.04  15.09     1         I    0         4
1   15.10  15.26     1      was,    0         4
2   15.27  15.37     1         I    1         4
3   15.38  15.51     1     don't    0         4
4   15.52  15.75     1  remember    0         4
5   15.76  15.87     1       you    0         4
6   15.88  16.02     1      want    0         4
7   16.03  16.08     1        to    0         4
8   16.09  17.10     1       go?    0         4
9   17.11  17.58     0       Um,    0         5
10  17.59  17.67     0       let    0         5
11  17.68  17.76     0        me    0         5
12  17.77  21.93     0    think.    0         5
13  21.94  22.20     1     Well,    0         6
14  22.21  22.56     1      most    0         6
15  22.57  22.62     1        of    0         6
16  22.63  23.85     1        my    0         6
17  23.86  24.33     1   problem    0         6
18  24.34  24.54     1      with    0         6
19  24.55  25.35     1      this    0         6
20  25.36  27.66     1    latest    0         6
21  27.67  28.20     1      task    0         6
22  28.21  28.32     1        is    0         6
23  28.33  28.56     1      just    0         6
24  28.57  28.68     1      that    0         6
25  28.69  28.77     1         I    2         6
26  28.78  29.04     1      felt    0         6

My thoughts maybe simple assign it ? :

x['New']=y.metadata # ? 

Upvotes: 1

Related Questions