IISsENII
IISsENII

Reputation: 23

Grepl matching between two columns

I have two data sets, one with an address column and another containing names of localities and their corresponding latitudes and longitudes.

Data set for stores:

+--------------------+-----------+--------------------------------------------------+
|     Store name     | Postcodes |                     Address                      |
+--------------------+-----------+--------------------------------------------------+
| Floral showers     |      2000 | Street 45, Level 9, Sydney, New South Wales 2000 |
| Cookie box         |      4300 | Shop 3, Queensland 4300                          |
| Mango troopers     |      2010 | Aberdeen, Bankstown, NSW                         |
| Building AE44      |      4300 | 778/9 Goulburn Street, QLD                       |
| Floral showers Co. |      2230 | Steert 47 Cronulla, New South Wales 2230         |
| Vinci supplies     |      2560 | West AIRDS, Mayfaille NSW                        |
+--------------------+-----------+--------------------------------------------------+

Dataset for lat long info:


+-------------------+-------+-------------+--------------+
|     Locality      | State |  Latitude   |  Longitude   |
+-------------------+-------+-------------+--------------+
| ABERDARE          | NSW   |  151.317476 |   -32.977861 |
| ABERDEEN          | NSW   |  151.102917 |    -32.14622 |
| ACACIA PLATEAU    | NSW   |   152.49765 |    -28.36456 |
| AIRDS             | NSW   |  150.768408 |   -34.194216 |
| ADAMINABY         | NSW   |  148.769744 |   -35.997349 |
| ABERCROMBIE RIVER | NSW   | 149.3476918 | -33.91030648 |
| CRONULLA          | NSW   |  151.136596 |   -34.093213 |
| SYDNEY            | NSW   |  151.268071 |   -33.794883 |
+-------------------+-------+-------------+--------------+

I want to create a new column to get the localities of each store from the address column and populate the lat and long from the other dataset. Since the address is not in a fixed format, I know I've to do a string search. However, I'm not sure how I can compare across two columns.

Here are the two sample dput outputs:

 structure(list(Stores_names = c("Floral showers", "Cookie box", 
"Mango troopers", "Building AE44", "Floral showers Co.", "Vinci supplies"
), Postcodes = c("2000", "4300", "2010", "4300", "2230", "2560"
), Address = c("Street 45, Level 9, Sydney, New South Wales 2000", 
"Shop 3, Queensland 4300", "Aberdeen, Bankstown, NSW", "778/9 Goulburn Street, QLD", 
"Steert 47 Cronulla, New South Wales 2230", "West AIRDS, Mayfaille NSW"
)), class = "data.frame", row.names = c(NA, -6L))

structure(list(Localities = c("ABERDARE", "ABERDEEN", "ACACIA PLATEAU", 
"AIRDS", "ADAMINABY", "ABERCROMBIE RIVER", "CRONULLA", "SYDNEY"
), State = c("NSW", "NSW", "NSW", "NSW", "NSW", "NSW", "NSW", 
"NSW"), lat = c("151.317476", "151.102917", "152.49765", "150.768408", 
"148.769744", "149.3476918", "151.136596", "151.268071"), long = c("-32.977861", 
"-32.14622", "-28.36456", "-34.194216", "-35.997349", "-33.91030648", 
"-34.093213", "-33.794883")), class = "data.frame", row.names = c(NA, 
-8L))

My final dataset should have three new columns: locality, latitude and longitude.


+--------------------+-----------+--------------------------------------------------+----------+------------+------------+
|     Store name     | Postcodes |                     Address                      | Locality |    lat     |    long    |
+--------------------+-----------+--------------------------------------------------+----------+------------+------------+
| Floral showers     |      2000 | Street 45, Level 9, Sydney, New South Wales 2000 | Sydney   | 151.268071 | -33.794883 |
| Cookie box         |      4300 | Shop 3, Queensland 4300                          |          |            |            |
| Mango troopers     |      2010 | Aberdeen, Bankstown, NSW                         | Aberdeen | 151.102917 |  -32.14622 |
| Building AE44      |      4300 | 778/9 Goulburn Street, QLD                       |          |            |            |
| Floral showers Co. |      2230 | Steert 47 Cronulla, New South Wales 2230         | Cronulla | 151.136596 | -34.093213 |
| Vinci supplies     |      2560 | West AIRDS, Mayfaille NSW                        | AIRDS    | 150.768408 | -34.194216 |
+--------------------+-----------+--------------------------------------------------+----------+------------+------------+

The ones that could not be found in the lat long set can remain blank, but I need all the data from the stores dataset.

Any help is appreciated!

Upvotes: 0

Views: 45

Answers (1)

Karthik S
Karthik S

Reputation: 11584

Does this work:

library(stringr)
library(dplyr)
df %>% mutate(city = str_extract(toupper(Address),paste0(df1$Localities, collapse = '|'))) %>% 
left_join(df1, by = c("city"="Localities"), keep = T) %>% select(-c(city,State))
        Stores_names Postcodes                                          Address Localities        lat       long
1     Floral showers      2000 Street 45, Level 9, Sydney, New South Wales 2000     SYDNEY 151.268071 -33.794883
2         Cookie box      4300                          Shop 3, Queensland 4300       <NA>       <NA>       <NA>
3     Mango troopers      2010                         Aberdeen, Bankstown, NSW   ABERDEEN 151.102917  -32.14622
4      Building AE44      4300                       778/9 Goulburn Street, QLD       <NA>       <NA>       <NA>
5 Floral showers Co.      2230         Steert 47 Cronulla, New South Wales 2230   CRONULLA 151.136596 -34.093213
6     Vinci supplies      2560                        West AIRDS, Mayfaille NSW      AIRDS 150.768408 -34.194216
> 

Upvotes: 0

Related Questions