Reputation: 794
I have two dataframes and want to merge them by two columns and retain the rest of the columns. Sometimes, Pandas just NaNs out all the other columns which previously had data in it. Thats something which I am not used to from SQL. Am I using the merge command wrong?
The datatypes and column names are: pdBewertungen
[[id]] int64
Matrikelnummer float64
Nachname object
Vorname object
Institution object
Übungsblätter gesamt (Punkte) object
Blatt1 object
Blatt2 object
Blatt3 object
Blatt4 object
Blatt5 object
Zuletzt aus diesem Kurs geladen int64
dtype: object
and pdGruppenBachelor
Vorname object
Nachname object
Matrikelnummer object
Mailadresse object
Gruppe object
Gruppenwahl float64
dtype: object
I would like to join by the pair [Vorname, Nachname] (sorry, that those names are German).
The result gives me a dataframe where Blatt1, Blatt2, ...are all NaN, even though they where integer before. The number of resulting columns is correct.
The command is:
dfBA = pd.merge(pdGruppenBachelor, pdBewertungen, how='left', on=['Vorname','Nachname'])
Sorry, I cannot give example code since the CSVs behind it are student data.
Is there anything I do wrong?
Upvotes: 0
Views: 241
Reputation: 1203
1st of all, it's not clear from your question what's each table name we can assume that the tables that contains the blats is the right one, if the join is left
There are a few things to check, in my opinion 1st of all, make an inner join, as per your example:
dfBA = pd.merge(pdGruppenBachelor, pdBewertungen, how='inner', on=['Vorname','Nachname'])
In that case, no NULLs shold appear, even if the number of rows will decrease. If that is the case, it will simply mean that there are not many (or few) matches between the tables... table1 holds people ['a','b','c'] while table2 holds ['a','c','d'] for example
Another check is look for NULLs for the keys in in each table:
pdGruppenBachelor[['Vorname','Nachname']].isna().sum()
pdBewertungen[['Vorname','Nachname']].isna().sum()
each NULL case, even partial (only missing Vorname) can affect the JOIN The last thing I would check, is if there aren't any spaces in the names, or upper/ lower cases: "John Smith" vs "joHn sMith "
so a possible solution should look like
pdGruppenBachelor['Vorname'] = pdGruppenBachelor['Vorname'].str.lower().strip()
pdBewertungen['Vorname'] = pdBewertungen['Vorname'].str.lower().strip()
pdGruppenBachelor['Nachname'] = pdGruppenBachelor['Nachname'].str.lower().strip()
pdBewertungen['Nachname'] = pdBewertungen['Nachname'].str.lower().strip()
then try and join
Upvotes: 2