uncool
uncool

Reputation: 2703

dplyr::left_join produce NA values for new joined columns

I have two tables I wish to left_join through the dplyr package. The issue is that is produces NA values for all new columns (the ones I'm after).

As you can see below, the left_join procudes NA values for the new column of Incep.Price and DayCounter. Why does this happen, and how can this be resolved?

Update: Thanks to @akrun, using left_join(Avanza.XML, checkpoint, by = c('Firm' = 'Firm')) solves the issue and the columns are joined correctly.

However the warning message is sitll the same, could someone explain this behaviour? Why one must in this case explicitly specify the join columns, or otherwise produce NA values?

> head(Avanza.XML)
                      Firm Gain.Month.1 Last.Price Vol.Month.1
1     Stockwik Förvaltning       131.25      0.074   131264420
2                 Novestra        37.14      7.200      605330
3       Bactiguard Holding        29.55     14.250     2815572
4              MSC Group B        20.87      3.070      671855
5 NeuroVive Pharmaceutical        18.07      9.800     3280944
6      Shelton Petroleum B        16.21      3.800     2135798

> head(checkpoint)
                      Firm Gain.Month.1 Last.Price Vol.Month.1 Incep.Price DayCounter
1     Stockwik Förvaltning        87.50       0.06    91270090    0.032000 2016-01-25
2                 Novestra        38.10       7.25      604683    5.249819 2016-01-25
3       Bactiguard Holding        29.09      14.20     2784161   11.000077 2016-01-25
4              MSC Group B        27.56       3.24      657699    2.539981 2016-01-25
5      Shelton Petroleum B        19.27       3.90     1985305    3.269892 2016-01-25
6 NeuroVive Pharmaceutical        16.87       9.70     3220303    8.299820 2016-01-25

> head(left_join(Avanza.XML, checkpoint))
Joining by: c("Firm", "Gain.Month.1", "Last.Price", "Vol.Month.1")
                      Firm Gain.Month.1 Last.Price Vol.Month.1 Incep.Price DayCounter
1     Stockwik Förvaltning       131.25      0.074   131264420          NA       <NA>
2                 Novestra        37.14      7.200      605330          NA       <NA>
3       Bactiguard Holding        29.55     14.250     2815572          NA       <NA>
4              MSC Group B        20.87      3.070      671855          NA       <NA>
5 NeuroVive Pharmaceutical        18.07      9.800     3280944          NA       <NA>
6      Shelton Petroleum B        16.21      3.800     2135798          NA       <NA>
Warning message:
In left_join_impl(x, y, by$x, by$y) :
  joining factors with different levels, coercing to character vector

Upvotes: 7

Views: 19536

Answers (1)

akrun
akrun

Reputation: 887118

There are two problems.

  1. Not specifying the by argument in left_join: In this case, by default all the columns are used as the variables to join by. If we look at the columns - "Gain.Month.1", "Last.Price", "Vol.Month.1" - all numeric class and do not have a matching value in each of the datasets. So, it is better to join by "Firm"

    left_join(Avanza.XML, checkpoint, by = "Firm")
    
  2. The "Firm" column class - factor: We get warning when there is difference in the levels of the factor column (if it is the variable that we join by). In order to remove the warning, we can either convert the "Firm" column in both datasets to character class

    Avanza.XML$Firm <- as.character(Avanza.XML$Firm)
    checkpoint$Firm <- as.character(checkpoint$Firm)
    

Or if we still want to keep the columns as factor, then change the levels in the "Firm" to include all the levels in both the datasets

lvls <- sort(unique(c(levels(Avanza.XML$Firm), 
                          levels(checkpoint$Firm))))
Avanza.XML$Firm <- factor(Avanza.XML$Firm, levels=lvls)
checkpoint$Firm <- factor(checkpoint$Firm, levels=lvls)

and then do the left_join.

Upvotes: 7

Related Questions