user2395969
user2395969

Reputation: 151

Why is ddply not working on this data frame?

I have a data frame as follows:

plan     address                                                                 preferred
S3440    5301 E Huron River Dr Rm 1538 Ann Arbor, MI 48106 1-734-712-2492, xxx   Not applicable
S3440    2140 E Ellsworth Rd Ann Arbor, MI 48108 1-734-477-9006, xxx             Not applicable
S3440    2215 Fuller Road Ann Arbor, MI 48105 1-734-761-7933, xxx                Not applicable

and such. About 27000s rows worth. There is more after the phone number after the address tab, I just omitted it for brevity.

I want to split the address up, basically removing the phone number and everything after it. I've been able to do that through a regular expression:

 str_split(x,'( [0-9]-[0-9]{3}-[0-9]{3}-[0-9]{4})')

I want to apply this function on every single row, so I've written a ddply "function:"

ddply(final_data2, .(address), function(x){str_split(x,'( [0-9]-[0-9]{3}-[0-9]{3}-[0-9]{4})')})

However, this spits out the error:

Error: String must be an atomic vector

and I don't know why. Can someone help me fix this?

Thanks

Upvotes: 1

Views: 438

Answers (2)

akrun
akrun

Reputation: 887048

Based on the pattern showed, you could try: (without using ddply)

 library(stringr)
 str_extract(final_data2$address, perl('.*(?= .-.*)'))
 #[1] "5301 E Huron River Dr Rm 1538 Ann Arbor, MI 48106"
 #[2] "2140 E Ellsworth Rd Ann Arbor, MI 48108"          
 #[3] "2215 Fuller Road Ann Arbor, MI 48105"             

Explanation

 ('.*(?= .-.*) # extract everything before a `space`, followed by one character, followed by `-`. 

Using your code:

 simplify2array(str_split(final_data2$address, '( [0-9]-[0-9]{3}-[0-9]{3}-[0-9]{4})'))[c(T,F)]

#[1] "5301 E Huron River Dr Rm 1538 Ann Arbor, MI 48106"
#[2] "2140 E Ellsworth Rd Ann Arbor, MI 48108"          
#[3] "2215 Fuller Road Ann Arbor, MI 48105"  

I don't understand why you want to use ddply and use address as grouping variable. This seems to work, but it is not needed.

unlist(daply(final_data2, .(address), function(x){str_split(x$address,'( [0-9]-[0-9]{3}-[0-9]{3}-[0-9]{4})')}),use.names=F)[c(T,F)]

Upvotes: 1

JeremyS
JeremyS

Reputation: 3525

An apply works

apply(final_data2[,2],1,function(x) str_split(x,'[0-9]-[0-9]{3}-[0-9]{3}-[0-9]{4}')[[1]][1])

But a gsub is faster

gsub("[0-9]-[0-9]{3}-[0-9]{3}-[0-9]{4}.*","",final_data2$address)

Upvotes: 0

Related Questions