user3200293
user3200293

Reputation: 191

merging two dataframes creates extra rows in R

I have two dataframes.

a = raw dataset with thousands of observations of different weather events. 6000 obs.

   STATE       EVTYPE
1     AL WINTER STORM
2     AL      TORNADO
3     AL    TSTM WIND
4     AL    TSTM WIND
5     AL    TSTM WIND
6     AL         HAIL
7     AL    HIGH WIND
8     AL    TSTM WIND
9     AL    TSTM WIND
10    AL    TSTM WIND

b = a dictionary table, which has a standard spelling for most of the weather events. 500 obs.

    EVTYPE    evmatch
71         DUST DEVEL DUST DEVIL
72         DUST DEVIL DUST DEVIL
73         DUST DEVIL DUST DEVIL
74         DUST STORM DUST STORM
75        EARLY FROST       <NA>
76         EARLY RAIN       <NA>

both are merged into df_new by evtype

library(dplyr)
df_new <- left_join(a, b, by = c("EVTYPE"))
STATE       EVTYPE           evmatch
1     AL WINTER STORM      WINTER STORM
2     AL      TORNADO           TORNADO
3     AL    TSTM WIND THUNDERSTORM WIND
4     AL    TSTM WIND THUNDERSTORM WIND
5     AL    TSTM WIND THUNDERSTORM WIND
6     AL         HAIL              HAIL
7     AL    HIGH WIND         HIGH WIND
8     AL    TSTM WIND THUNDERSTORM WIND
9     AL    TSTM WIND THUNDERSTORM WIND
10    AL    TSTM WIND THUNDERSTORM WIND
11    AL   HEAVY RAIN        HEAVY RAIN
12    AL  FLASH FLOOD       FLASH FLOOD
13    AL    TSTM WIND THUNDERSTORM WIND
14    AL   HEAVY RAIN        HEAVY RAIN
15    AL    TSTM WIND THUNDERSTORM WIND

When I join the two, I end up with a dataset that has more obs than the original.

I.e. a has 6000 obs and df_new has 69400.

How can I merge both datasets so that the resulting df is the same of the original dataframe.

context:

The original dataset has multiple typos/spellings for the same weather event, i.e. dust devel, dust devil.

In order to run analysis, each observation needs standard spelling. For example, I am unable to run analysis on dust devil because of the multiple observations of the same event have different spellings.

I've created a "dictionary" that has the official spelling for each event type, i.e. dust devel, dust devil are both dust devil and tstm wind is thunderstorm wind

I want to add a column onto the original dataset, showing the correct spelling for each weather event.

However, my resulting dataset has extra rows.

Upvotes: 1

Views: 4147

Answers (1)

moodymudskipper
moodymudskipper

Reputation: 47320

Your b table has duplicates, replace b by unique(b) and you should be fine.

df_new <- left_join(a, unique(b))

"Left join" just means all rows from a will be used, even if they don't have matches in b. If they have several matches in b, you'll get additional lines in df_new.

See this for further info:

https://stat545-ubc.github.io/bit001_dplyr-cheatsheet.html#left_joinsuperheroes-publishers

Upvotes: 5

Related Questions