Reputation: 23
I have two data sets, one with an address column and another containing names of localities and their corresponding latitudes and longitudes.
Data set for stores:
+--------------------+-----------+--------------------------------------------------+
| Store name | Postcodes | Address |
+--------------------+-----------+--------------------------------------------------+
| Floral showers | 2000 | Street 45, Level 9, Sydney, New South Wales 2000 |
| Cookie box | 4300 | Shop 3, Queensland 4300 |
| Mango troopers | 2010 | Aberdeen, Bankstown, NSW |
| Building AE44 | 4300 | 778/9 Goulburn Street, QLD |
| Floral showers Co. | 2230 | Steert 47 Cronulla, New South Wales 2230 |
| Vinci supplies | 2560 | West AIRDS, Mayfaille NSW |
+--------------------+-----------+--------------------------------------------------+
Dataset for lat long info:
+-------------------+-------+-------------+--------------+
| Locality | State | Latitude | Longitude |
+-------------------+-------+-------------+--------------+
| ABERDARE | NSW | 151.317476 | -32.977861 |
| ABERDEEN | NSW | 151.102917 | -32.14622 |
| ACACIA PLATEAU | NSW | 152.49765 | -28.36456 |
| AIRDS | NSW | 150.768408 | -34.194216 |
| ADAMINABY | NSW | 148.769744 | -35.997349 |
| ABERCROMBIE RIVER | NSW | 149.3476918 | -33.91030648 |
| CRONULLA | NSW | 151.136596 | -34.093213 |
| SYDNEY | NSW | 151.268071 | -33.794883 |
+-------------------+-------+-------------+--------------+
I want to create a new column to get the localities of each store from the address column and populate the lat and long from the other dataset. Since the address is not in a fixed format, I know I've to do a string search. However, I'm not sure how I can compare across two columns.
Here are the two sample dput outputs:
structure(list(Stores_names = c("Floral showers", "Cookie box",
"Mango troopers", "Building AE44", "Floral showers Co.", "Vinci supplies"
), Postcodes = c("2000", "4300", "2010", "4300", "2230", "2560"
), Address = c("Street 45, Level 9, Sydney, New South Wales 2000",
"Shop 3, Queensland 4300", "Aberdeen, Bankstown, NSW", "778/9 Goulburn Street, QLD",
"Steert 47 Cronulla, New South Wales 2230", "West AIRDS, Mayfaille NSW"
)), class = "data.frame", row.names = c(NA, -6L))
structure(list(Localities = c("ABERDARE", "ABERDEEN", "ACACIA PLATEAU",
"AIRDS", "ADAMINABY", "ABERCROMBIE RIVER", "CRONULLA", "SYDNEY"
), State = c("NSW", "NSW", "NSW", "NSW", "NSW", "NSW", "NSW",
"NSW"), lat = c("151.317476", "151.102917", "152.49765", "150.768408",
"148.769744", "149.3476918", "151.136596", "151.268071"), long = c("-32.977861",
"-32.14622", "-28.36456", "-34.194216", "-35.997349", "-33.91030648",
"-34.093213", "-33.794883")), class = "data.frame", row.names = c(NA,
-8L))
My final dataset should have three new columns: locality, latitude and longitude.
+--------------------+-----------+--------------------------------------------------+----------+------------+------------+
| Store name | Postcodes | Address | Locality | lat | long |
+--------------------+-----------+--------------------------------------------------+----------+------------+------------+
| Floral showers | 2000 | Street 45, Level 9, Sydney, New South Wales 2000 | Sydney | 151.268071 | -33.794883 |
| Cookie box | 4300 | Shop 3, Queensland 4300 | | | |
| Mango troopers | 2010 | Aberdeen, Bankstown, NSW | Aberdeen | 151.102917 | -32.14622 |
| Building AE44 | 4300 | 778/9 Goulburn Street, QLD | | | |
| Floral showers Co. | 2230 | Steert 47 Cronulla, New South Wales 2230 | Cronulla | 151.136596 | -34.093213 |
| Vinci supplies | 2560 | West AIRDS, Mayfaille NSW | AIRDS | 150.768408 | -34.194216 |
+--------------------+-----------+--------------------------------------------------+----------+------------+------------+
The ones that could not be found in the lat long set can remain blank, but I need all the data from the stores dataset.
Any help is appreciated!
Upvotes: 0
Views: 45
Reputation: 11584
Does this work:
library(stringr)
library(dplyr)
df %>% mutate(city = str_extract(toupper(Address),paste0(df1$Localities, collapse = '|'))) %>%
left_join(df1, by = c("city"="Localities"), keep = T) %>% select(-c(city,State))
Stores_names Postcodes Address Localities lat long
1 Floral showers 2000 Street 45, Level 9, Sydney, New South Wales 2000 SYDNEY 151.268071 -33.794883
2 Cookie box 4300 Shop 3, Queensland 4300 <NA> <NA> <NA>
3 Mango troopers 2010 Aberdeen, Bankstown, NSW ABERDEEN 151.102917 -32.14622
4 Building AE44 4300 778/9 Goulburn Street, QLD <NA> <NA> <NA>
5 Floral showers Co. 2230 Steert 47 Cronulla, New South Wales 2230 CRONULLA 151.136596 -34.093213
6 Vinci supplies 2560 West AIRDS, Mayfaille NSW AIRDS 150.768408 -34.194216
>
Upvotes: 0