Reputation: 85

how to split addresses (with uneven format) into various fields in R

I would like to split these addresses into respective categories(street number, street name, city , state and Zip) to ultimately check which are the same. Can anyone help with a basic idea on how to carry this out in R?

    Company                          Address

 1. A                    1 NE 1 Street Miami,FL 33132
 2. B                     1 1st Street Miami,FL 33132
 3. C                      1 NE 1st St Miami,FL 33132
 4. D                     1 1st Street Miami,FL 33134
 5. E               100 Biscayne Blvd. Miami,FL 33132
 6. F               100 Biscayne Blvd Miami ,FL 33132
 7. G 100 Biscayne Boulevard Suite 604 Miami,FL 33132
 8. H     100 Biscayne Blvd. Suite 604 Miami,FL 33132
 9. I            100 N. Biscayne Blvd. Miami,FL 33132

Upvotes: 2

Answers (2)

user12864379

Reputation: 15

You can also use 'StringR' package for this. The function to be used is 'Str_extract". This would extract the city names based upon the given database.

To extract the street no. , you could use "gsub" and "^[[:digit:]]" .

Upvotes: 0

G. Grothendieck

Reputation: 269694

Try read.pattern in the gsubfn package. If the Lines are in a file then replace text = Lines with a character string giving the file name. This can be fairly fragile and you may need to adjust the regular expression somewhat once you have more data to try it out with.

Lines <- "Company                          Address
 1. A                    1 NE 1 Street Miami,FL 33132
 2. B                     1 1st Street Miami,FL 33132
 3. C                      1 NE 1st St Miami,FL 33132
 4. D                     1 1st Street Miami,FL 33134
 5. E               100 Biscayne Blvd. Miami,FL 33132
 6. F               100 Biscayne Blvd Miami ,FL 33132
 7. G 100 Biscayne Boulevard Suite 604 Miami,FL 33132
 8. H     100 Biscayne Blvd. Suite 604 Miami,FL 33132
 9. I            100 N. Biscayne Blvd. Miami,FL 33132"

library(gsubfn)
DF <- read.pattern(text = Lines, 
  pattern = "\\S+ \\S+ *(\\d+) (.*) (\\S+) ?,(\\S+) (\\d+)$",
  skip = 1, 
  as.is = TRUE,
  col.names = c("No", "Street", "City", "State", "Zip"))

giving:

> DF
   No                       Street  City State   Zip
1   1                  NE 1 Street Miami    FL 33132
2   1                   1st Street Miami    FL 33132
3   1                    NE 1st St Miami    FL 33132
4   1                   1st Street Miami    FL 33134
5 100               Biscayne Blvd. Miami    FL 33132
6 100                Biscayne Blvd Miami    FL 33132
7 100 Biscayne Boulevard Suite 604 Miami    FL 33132
8 100     Biscayne Blvd. Suite 604 Miami    FL 33132
9 100            N. Biscayne Blvd. Miami    FL 33132

Here is the regular expression visualized:

\S+ \S+ *(\d+) (.*) (\S+) ?,(\S+) (\d+)$

Regular expression visualization

Debuggex Demo

Upvotes: 5

how to split addresses (with uneven format) into various fields in R

Answers (2)

Related Questions