Reputation: 85
I would like to split these addresses into respective categories(street number, street name, city , state and Zip) to ultimately check which are the same. Can anyone help with a basic idea on how to carry this out in R?
Company Address
1. A 1 NE 1 Street Miami,FL 33132
2. B 1 1st Street Miami,FL 33132
3. C 1 NE 1st St Miami,FL 33132
4. D 1 1st Street Miami,FL 33134
5. E 100 Biscayne Blvd. Miami,FL 33132
6. F 100 Biscayne Blvd Miami ,FL 33132
7. G 100 Biscayne Boulevard Suite 604 Miami,FL 33132
8. H 100 Biscayne Blvd. Suite 604 Miami,FL 33132
9. I 100 N. Biscayne Blvd. Miami,FL 33132
Upvotes: 2
Views: 792
Reputation: 15
You can also use 'StringR' package for this. The function to be used is 'Str_extract". This would extract the city names based upon the given database.
To extract the street no. , you could use "gsub" and "^[[:digit:]]" .
Upvotes: 0
Reputation: 269694
Try read.pattern
in the gsubfn package. If the Lines are in a file then replace text = Lines
with a character string giving the file name. This can be fairly fragile and you may need to adjust the regular expression somewhat once you have more data to try it out with.
Lines <- "Company Address
1. A 1 NE 1 Street Miami,FL 33132
2. B 1 1st Street Miami,FL 33132
3. C 1 NE 1st St Miami,FL 33132
4. D 1 1st Street Miami,FL 33134
5. E 100 Biscayne Blvd. Miami,FL 33132
6. F 100 Biscayne Blvd Miami ,FL 33132
7. G 100 Biscayne Boulevard Suite 604 Miami,FL 33132
8. H 100 Biscayne Blvd. Suite 604 Miami,FL 33132
9. I 100 N. Biscayne Blvd. Miami,FL 33132"
library(gsubfn)
DF <- read.pattern(text = Lines,
pattern = "\\S+ \\S+ *(\\d+) (.*) (\\S+) ?,(\\S+) (\\d+)$",
skip = 1,
as.is = TRUE,
col.names = c("No", "Street", "City", "State", "Zip"))
giving:
> DF
No Street City State Zip
1 1 NE 1 Street Miami FL 33132
2 1 1st Street Miami FL 33132
3 1 NE 1st St Miami FL 33132
4 1 1st Street Miami FL 33134
5 100 Biscayne Blvd. Miami FL 33132
6 100 Biscayne Blvd Miami FL 33132
7 100 Biscayne Boulevard Suite 604 Miami FL 33132
8 100 Biscayne Blvd. Suite 604 Miami FL 33132
9 100 N. Biscayne Blvd. Miami FL 33132
Here is the regular expression visualized:
\S+ \S+ *(\d+) (.*) (\S+) ?,(\S+) (\d+)$
Upvotes: 5