Pat Stroh
Pat Stroh

Reputation: 199

R -- Repeating a substring task across data frame columns

Very simple problem in SAS, not so clear to me in R (beginner).

ID <- c('001','002','003')
name1 <- c('ZZ: John','YY: Pete','UU: Judy')
name2 <- c('55: Smith','78: Philips','99: Cortes')
name3 <- c('BB: Jr.','CC: Mr.','56: Dr.')
customer.data <- data.frame(ID, name1, name2, name3)

I want to delete the first 4 characters (including space) from each variable such the output looks like this:

ID  name1 name2 name3
001 John Smith Jr.
002 Pete Philips Mr.
003 Judy Cortes Dr.

.... I need to do this over a long list of variables (not just 3, as in my example). The same substring function over and over again, then rewrite the data frame as shown.

I could accomplish this easily in SAS (my legacy program / trying to get away from)

ARRAY FIRSTSTUFF (3) name1 name2 name3;
ARRAY OUTPUTSTUFF (3) name1 name2 name3;
do i=1 to 3;
FORMAT OUTPUTSTUFF(i) $10.;
OUTPUTSTUFF(i)=substring(FIRSTSTUFF(i),5,10);
end;

I am baffled by the R approach to this. Any help is appreciated.

Upvotes: 0

Views: 562

Answers (1)

akrun
akrun

Reputation: 887981

We loop through the columns of 'customer.data' except the first one (customer.data[,-1]) using lapply, extract the substring from 5th character to the last character of the string using substr, and assign the output back to the corresponding columns of the dataset.

 customer.data[,-1] <- lapply(customer.data[,-1],
              function(x) substr(x,5,nchar(as.character(x))))

 customer.data
 #  ID name1   name2 name3
 #1 001  John   Smith   Jr.
 #2 002  Pete Philips   Mr.
 #3 003  Judy  Cortes   Dr.

Or in the above you don't have to specify the stop if you are using substring instead of substr (as @Richard Scriven showed in the comments)

 customer.data[,-1] <- lapply(customer.data[-1], substring, 5)

Or you could use gsub to match the characters (.*- 0 or more characters) from the beginning till the : followed by white space +, and replace it with '' as the second argument for each columns looped by lapply.

 customer.data[,-1] <- lapply(customer.data[,-1], function(x)
                                          gsub(".*: +", "", x))

Upvotes: 2

Related Questions