N.F
N.F

Reputation: 23

Count words of a string if they occur in a certain position within the string in R

I have a string variable tours in my dataframe df that represents the different stops an individuum did during a journey.

For example:
1. home_work_leisure_home
2. home_work_shopping_work_home
3. home_work_leisure_errand_home

In Transport planning we group activities in primary (work and education) and secondary activities (everything else). I want to count the number of secondary activities before the first primary activity, inbetween two primary activities after the last primary activity for each tour.

This means I am looking for a function in R that:
a. identifies the first work in the string variable,
b. then counts the number of activities before this first work activity
c. then identifies the last work in the string if there is more than one
d. if there is then count the number of activities between the two work activities,
e. then count the number of activities after the last work activity

The result for the three example tours then would be:

  1. number of activities before first primary: 1 (home)
    number of activities between first and last primary: 0
    number of activities after last primary: 2 (leisure & home)
    number of primary activities: 1 (work)
  2. number of activities before first primary: 1 (home)
    number of activities between first and last primary: 1 (shopping)
    number of activities after last primary: 1 (home)
    number of primary activities: 2 (work)
  3. number of activities before first primary: 1 (home)
    number of activities between first and last primary: 0
    number of activities after last primary: 3 (leisure, errand & home)
    number of primary activities: 1 (work)

I would be super thankful if someone could give me a hand with this issue - even if it is a link to a similar question.

Tank you. Kind regards Nathalie

Upvotes: 3

Views: 51

Answers (1)

Tony Hellmuth
Tony Hellmuth

Reputation: 290

This should get you started; you can replace "work" and "education" with anything you want:

> x
[1] "home_work_leisure_home"        "home_work_shopping_work_home"  "home_work_leisure_errand_home"
> strsplit(x,"_")
[[1]]
[1] "home"    "work"    "leisure" "home"   

[[2]]
[1] "home"     "work"     "shopping" "work"     "home"    

[[3]]
[1] "home"    "work"    "leisure" "errand"  "home"   

ad_last_p<-bet_f_l_p<-be_first_p<-prim_n<-numeric()

for(i in 1:length(x)){
  y<-sort(c(which(x[[i]]=="education"),which(x[[i]]=="work"))) ### In each of the examples, find which ones are Primary.
  prim_n[i]<-length(y) ### Number of Primary activities
  be_first_p[i]<-ifelse(y[1]>1,y[1]-1,0) ### Number before First Primary
  bet_f_l_p[i]<-ifelse(length(y)>1,sum(diff(y))-length(y)+1,0) ### Between Primary 1 and 2.
  ad_last_p[i]<-length(x[[i]])-y[length(y)] ### Number after last primary
}

> z<-cbind(be_first_p,bet_f_l_p,af_last_p,prim_n)
> z
     be_first_p bet_f_l_p af_last_p prim_n
[1,]          1         0         2      1
[2,]          1         1         1      2
[3,]          1         0         3      1

Hopefully you wanted something simple like this? :) Please let me know if you want any clarifications!

######## EDIT 1 ########

I tried it with a list of 10,000 examples and took about 0.5 seconds. Seems okay. O(n) as worst. If the activities do not consist of any work or education, you can add this in the second line of the loop:

if(length(y)==0){next}.

This will ensure the code works when no primary is recorded, and the output for those cases will be "NA". You can get rid of those silly NA results by using:

z<-z%>%na.omit

Upvotes: 1

Related Questions