Reputation: 151
I have a data frame as follows:
plan address preferred
S3440 5301 E Huron River Dr Rm 1538 Ann Arbor, MI 48106 1-734-712-2492, xxx Not applicable
S3440 2140 E Ellsworth Rd Ann Arbor, MI 48108 1-734-477-9006, xxx Not applicable
S3440 2215 Fuller Road Ann Arbor, MI 48105 1-734-761-7933, xxx Not applicable
and such. About 27000s rows worth. There is more after the phone number after the address tab, I just omitted it for brevity.
I want to split the address up, basically removing the phone number and everything after it. I've been able to do that through a regular expression:
str_split(x,'( [0-9]-[0-9]{3}-[0-9]{3}-[0-9]{4})')
I want to apply this function on every single row, so I've written a ddply "function:"
ddply(final_data2, .(address), function(x){str_split(x,'( [0-9]-[0-9]{3}-[0-9]{3}-[0-9]{4})')})
However, this spits out the error:
Error: String must be an atomic vector
and I don't know why. Can someone help me fix this?
Thanks
Upvotes: 1
Views: 438
Reputation: 887048
Based on the pattern showed, you could try: (without using ddply
)
library(stringr)
str_extract(final_data2$address, perl('.*(?= .-.*)'))
#[1] "5301 E Huron River Dr Rm 1538 Ann Arbor, MI 48106"
#[2] "2140 E Ellsworth Rd Ann Arbor, MI 48108"
#[3] "2215 Fuller Road Ann Arbor, MI 48105"
('.*(?= .-.*) # extract everything before a `space`, followed by one character, followed by `-`.
Using your code:
simplify2array(str_split(final_data2$address, '( [0-9]-[0-9]{3}-[0-9]{3}-[0-9]{4})'))[c(T,F)]
#[1] "5301 E Huron River Dr Rm 1538 Ann Arbor, MI 48106"
#[2] "2140 E Ellsworth Rd Ann Arbor, MI 48108"
#[3] "2215 Fuller Road Ann Arbor, MI 48105"
I don't understand why you want to use ddply
and use address
as grouping variable. This seems to work, but it is not needed.
unlist(daply(final_data2, .(address), function(x){str_split(x$address,'( [0-9]-[0-9]{3}-[0-9]{3}-[0-9]{4})')}),use.names=F)[c(T,F)]
Upvotes: 1
Reputation: 3525
An apply works
apply(final_data2[,2],1,function(x) str_split(x,'[0-9]-[0-9]{3}-[0-9]{3}-[0-9]{4}')[[1]][1])
But a gsub is faster
gsub("[0-9]-[0-9]{3}-[0-9]{3}-[0-9]{4}.*","",final_data2$address)
Upvotes: 0