merging returns odd length

Question

I am having a problem with a relatively simple task...

I have two dataframes: df_sample which I read from csv

+------+-----------+-------+-----------+
| key  | Full Text | Date  | Publisher |
+------+-----------+-------+-----------+
| abcd | foofoo    | date1 | a         |
| bcde | barbar    | date2 | b         |
| cdef | foobar    | date3 | c         |
+------+-----------+-------+-----------+

len(df_sample) = 20000

df_labels which I read from excel

+------+----------+--------+--------+
| key  | relevant | other  | other2 |
+------+----------+--------+--------+
| abcd | yes      | blabla | blabla |
| bcde | no       | blabla | blabla |
| cdef | no       | blabla | blabla |
| defg | yes      | blabla | blabla |
+------+----------+--------+--------+

len(df_labels) = 219000

I would like to join both tables on key assigning relevant value for each key from the first dataframe. The desired output would be like this:

+------+-----------+-------+-----------+----------+
| key  | Full Text | Date  | Publisher | relevant |
+------+-----------+-------+-----------+----------+
| abcd | foofoo    | date1 | a         | yes      |
| bcde | barbar    | date2 | b         | no       |
| cdef | foobar    | date3 | c         | no       |
+------+-----------+-------+-----------+----------+

I seem to achieve that but why the following gives me 27377 results instead of 20000 (as in original left table):

df = pd.merge(left=df_sample, right=df_labels, on="key")

EdChum · Accepted Answer

You're seeing additional rows because the keys are not unique across both dfs, in your case the second df. You'll need to decide whether you want repeated rows which is the current behaviour or you want to drop the duplicates in the second df:

df_labels = df_labels.drop_duplicates(subset='key')

this will by default keep just the first duplicate, if you want alternate behaviour such as keep last, then you can pass: keep='last' see the docs

merging returns odd length

Answers (1)

Related Questions