Reputation: 2703
I have two tables I wish to left_join
through the dplyr
package. The issue is that is produces NA
values for all new columns (the ones I'm after).
As you can see below, the left_join
procudes NA
values for the new column of Incep.Price
and DayCounter
. Why does this happen, and how can this be resolved?
Update: Thanks to @akrun, using left_join(Avanza.XML, checkpoint, by = c('Firm' = 'Firm'))
solves the issue and the columns are joined correctly.
However the warning message is sitll the same, could someone explain this behaviour? Why one must in this case explicitly specify the join
columns, or otherwise produce NA
values?
> head(Avanza.XML)
Firm Gain.Month.1 Last.Price Vol.Month.1
1 Stockwik Förvaltning 131.25 0.074 131264420
2 Novestra 37.14 7.200 605330
3 Bactiguard Holding 29.55 14.250 2815572
4 MSC Group B 20.87 3.070 671855
5 NeuroVive Pharmaceutical 18.07 9.800 3280944
6 Shelton Petroleum B 16.21 3.800 2135798
> head(checkpoint)
Firm Gain.Month.1 Last.Price Vol.Month.1 Incep.Price DayCounter
1 Stockwik Förvaltning 87.50 0.06 91270090 0.032000 2016-01-25
2 Novestra 38.10 7.25 604683 5.249819 2016-01-25
3 Bactiguard Holding 29.09 14.20 2784161 11.000077 2016-01-25
4 MSC Group B 27.56 3.24 657699 2.539981 2016-01-25
5 Shelton Petroleum B 19.27 3.90 1985305 3.269892 2016-01-25
6 NeuroVive Pharmaceutical 16.87 9.70 3220303 8.299820 2016-01-25
> head(left_join(Avanza.XML, checkpoint))
Joining by: c("Firm", "Gain.Month.1", "Last.Price", "Vol.Month.1")
Firm Gain.Month.1 Last.Price Vol.Month.1 Incep.Price DayCounter
1 Stockwik Förvaltning 131.25 0.074 131264420 NA <NA>
2 Novestra 37.14 7.200 605330 NA <NA>
3 Bactiguard Holding 29.55 14.250 2815572 NA <NA>
4 MSC Group B 20.87 3.070 671855 NA <NA>
5 NeuroVive Pharmaceutical 18.07 9.800 3280944 NA <NA>
6 Shelton Petroleum B 16.21 3.800 2135798 NA <NA>
Warning message:
In left_join_impl(x, y, by$x, by$y) :
joining factors with different levels, coercing to character vector
Upvotes: 7
Views: 19536
Reputation: 887118
There are two problems.
Not specifying the by
argument in left_join
: In this case, by default all the columns are used as the variables to join by. If we look at the columns - "Gain.Month.1", "Last.Price", "Vol.Month.1" - all numeric
class and do not have a matching value in each of the datasets. So, it is better to join by "Firm"
left_join(Avanza.XML, checkpoint, by = "Firm")
The "Firm" column class - factor
: We get warning when there is difference in the levels
of the factor
column (if it is the variable that we join by). In order to remove the warning, we can either convert the "Firm" column in both datasets to character
class
Avanza.XML$Firm <- as.character(Avanza.XML$Firm)
checkpoint$Firm <- as.character(checkpoint$Firm)
Or if we still want to keep the columns as factor
, then change the levels
in the "Firm" to include all the levels
in both the datasets
lvls <- sort(unique(c(levels(Avanza.XML$Firm),
levels(checkpoint$Firm))))
Avanza.XML$Firm <- factor(Avanza.XML$Firm, levels=lvls)
checkpoint$Firm <- factor(checkpoint$Firm, levels=lvls)
and then do the left_join
.
Upvotes: 7