Reputation: 191
I have two dataframes.
a
= raw dataset with thousands of observations of different weather events. 6000 obs.
STATE EVTYPE
1 AL WINTER STORM
2 AL TORNADO
3 AL TSTM WIND
4 AL TSTM WIND
5 AL TSTM WIND
6 AL HAIL
7 AL HIGH WIND
8 AL TSTM WIND
9 AL TSTM WIND
10 AL TSTM WIND
b
= a dictionary table, which has a standard spelling for most of the weather events. 500 obs.
EVTYPE evmatch
71 DUST DEVEL DUST DEVIL
72 DUST DEVIL DUST DEVIL
73 DUST DEVIL DUST DEVIL
74 DUST STORM DUST STORM
75 EARLY FROST <NA>
76 EARLY RAIN <NA>
both are merged into df_new
by evtype
library(dplyr)
df_new <- left_join(a, b, by = c("EVTYPE"))
STATE EVTYPE evmatch
1 AL WINTER STORM WINTER STORM
2 AL TORNADO TORNADO
3 AL TSTM WIND THUNDERSTORM WIND
4 AL TSTM WIND THUNDERSTORM WIND
5 AL TSTM WIND THUNDERSTORM WIND
6 AL HAIL HAIL
7 AL HIGH WIND HIGH WIND
8 AL TSTM WIND THUNDERSTORM WIND
9 AL TSTM WIND THUNDERSTORM WIND
10 AL TSTM WIND THUNDERSTORM WIND
11 AL HEAVY RAIN HEAVY RAIN
12 AL FLASH FLOOD FLASH FLOOD
13 AL TSTM WIND THUNDERSTORM WIND
14 AL HEAVY RAIN HEAVY RAIN
15 AL TSTM WIND THUNDERSTORM WIND
When I join the two, I end up with a dataset that has more obs than the original.
I.e. a
has 6000 obs and df_new
has 69400.
How can I merge both datasets so that the resulting df is the same of the original dataframe.
The original dataset has multiple typos/spellings for the same weather event, i.e. dust devel
, dust devil
.
In order to run analysis, each observation needs standard spelling. For example, I am unable to run analysis on dust devil
because of the multiple observations of the same event have different spellings.
I've created a "dictionary" that has the official spelling for each event type, i.e. dust devel
, dust devil
are both dust devil
and tstm wind
is thunderstorm wind
I want to add a column onto the original dataset, showing the correct spelling for each weather event.
However, my resulting dataset has extra rows.
Upvotes: 1
Views: 4147
Reputation: 47320
Your b table has duplicates, replace b
by unique(b)
and you should be fine.
df_new <- left_join(a, unique(b))
"Left join" just means all rows from a will be used, even if they don't have matches in b
. If they have several matches in b
, you'll get additional lines in df_new
.
See this for further info:
https://stat545-ubc.github.io/bit001_dplyr-cheatsheet.html#left_joinsuperheroes-publishers
Upvotes: 5