Reputation: 1000
I am having a problem with a relatively simple task...
I have two dataframes:
df_sample
which I read from csv
+------+-----------+-------+-----------+
| key | Full Text | Date | Publisher |
+------+-----------+-------+-----------+
| abcd | foofoo | date1 | a |
| bcde | barbar | date2 | b |
| cdef | foobar | date3 | c |
+------+-----------+-------+-----------+
len(df_sample) = 20000
df_labels
which I read from excel
+------+----------+--------+--------+
| key | relevant | other | other2 |
+------+----------+--------+--------+
| abcd | yes | blabla | blabla |
| bcde | no | blabla | blabla |
| cdef | no | blabla | blabla |
| defg | yes | blabla | blabla |
+------+----------+--------+--------+
len(df_labels) = 219000
I would like to join both tables on key assigning relevant
value for each key from the first dataframe. The desired output would be like this:
+------+-----------+-------+-----------+----------+
| key | Full Text | Date | Publisher | relevant |
+------+-----------+-------+-----------+----------+
| abcd | foofoo | date1 | a | yes |
| bcde | barbar | date2 | b | no |
| cdef | foobar | date3 | c | no |
+------+-----------+-------+-----------+----------+
I seem to achieve that but why the following gives me 27377 results instead of 20000 (as in original left table):
df = pd.merge(left=df_sample, right=df_labels, on="key")
Upvotes: 1
Views: 36
Reputation: 394179
You're seeing additional rows because the keys are not unique across both dfs, in your case the second df. You'll need to decide whether you want repeated rows which is the current behaviour or you want to drop the duplicates in the second df:
df_labels = df_labels.drop_duplicates(subset='key')
this will by default keep just the first duplicate, if you want alternate behaviour such as keep last, then you can pass: keep='last'
see the docs
Upvotes: 2