Hemmelig
Hemmelig

Reputation: 794

Pandas Merge NaN in other columns

I have two dataframes and want to merge them by two columns and retain the rest of the columns. Sometimes, Pandas just NaNs out all the other columns which previously had data in it. Thats something which I am not used to from SQL. Am I using the merge command wrong?

The datatypes and column names are: pdBewertungen

[[id]]                               int64
Matrikelnummer                     float64
Nachname                            object
Vorname                             object
Institution                         object
Übungsblätter gesamt (Punkte)       object
Blatt1                              object
Blatt2                              object
Blatt3                              object
Blatt4                              object
Blatt5                              object
Zuletzt aus diesem Kurs geladen      int64
dtype: object

and pdGruppenBachelor

Vorname            object
Nachname           object
Matrikelnummer     object
Mailadresse        object
Gruppe             object
Gruppenwahl       float64
dtype: object

I would like to join by the pair [Vorname, Nachname] (sorry, that those names are German).

The result gives me a dataframe where Blatt1, Blatt2, ...are all NaN, even though they where integer before. The number of resulting columns is correct.

The command is:

dfBA = pd.merge(pdGruppenBachelor, pdBewertungen,  how='left', on=['Vorname','Nachname'])

Sorry, I cannot give example code since the CSVs behind it are student data.

Is there anything I do wrong?

Upvotes: 0

Views: 241

Answers (1)

Guy Louzon
Guy Louzon

Reputation: 1203

1st of all, it's not clear from your question what's each table name we can assume that the tables that contains the blats is the right one, if the join is left

There are a few things to check, in my opinion 1st of all, make an inner join, as per your example:

dfBA = pd.merge(pdGruppenBachelor, pdBewertungen,  how='inner', on=['Vorname','Nachname'])

In that case, no NULLs shold appear, even if the number of rows will decrease. If that is the case, it will simply mean that there are not many (or few) matches between the tables... table1 holds people ['a','b','c'] while table2 holds ['a','c','d'] for example

Another check is look for NULLs for the keys in in each table:

pdGruppenBachelor[['Vorname','Nachname']].isna().sum()
pdBewertungen[['Vorname','Nachname']].isna().sum()

each NULL case, even partial (only missing Vorname) can affect the JOIN The last thing I would check, is if there aren't any spaces in the names, or upper/ lower cases: "John Smith" vs "joHn sMith "

so a possible solution should look like

pdGruppenBachelor['Vorname'] = pdGruppenBachelor['Vorname'].str.lower().strip()
pdBewertungen['Vorname'] = pdBewertungen['Vorname'].str.lower().strip()

pdGruppenBachelor['Nachname'] = pdGruppenBachelor['Nachname'].str.lower().strip()
pdBewertungen['Nachname'] = pdBewertungen['Nachname'].str.lower().strip()

then try and join

Upvotes: 2

Related Questions