I am a beginner data scientist working within an NCAA football program trying to learn R. We record repeated measures data during practice that needs to be organized then analyzed. I have this sample dataset that I am having trouble analyzing due to the variations/typos in the "Activity" column. I need a way to rename/organize the activities in the dataframe. Raw Dataset looks like this: Position Activity Max_Velocity (mph) WR Special Team 1 20 WR special team 1 19 WR Special team 1 18 WR special Team 2 13 WR ST 2 16 WR St 2 19 WR team 1 12 WR team 1 20 WR team 1 23 WR team 2 21 WR team 2 10 WR team 2 3 WR team 3 21 WR team 3 11 WR team 3 16 WR Indy 1 20 WR indy 2 21 WR INDY 3 22 I need it to look like this: Position Activity Max_Velocity (mph) Activity_Catagory WR Special Team 1 20 Special Team WR special team 1 19 Special Team WR Special team 1 18 Special Team WR special Team 2 13 Special Team WR ST 2 16 Special Team WR St 2 19 Special Team WR team 1 12 Team WR team 1 20 Team WR team 1 23 Team WR team 2 21 Team WR team 2 10 Team WR team 2 3 Team WR team 3 21 Team WR team 3 11 Team WR team 3 16 Team WR Indy 1 20 Indy WR indy 2 21 Indy WR INDY 3 22 Indy This way I can analyze the average of the dependent variable "Max_velocity" during each type of activity. I would rather not make individual lines of code correcting each typo in the "Activity" column. I think this can be done with loops but I am a beginner and don't know where to start. I would greatly appreciate any help to make this a simple fix.

Reputation: 45

R data frame organization

I am a beginner data scientist working within an NCAA football program trying to learn R. We record repeated measures data during practice that needs to be organized then analyzed. I have this sample dataset that I am having trouble analyzing due to the variations/typos in the "Activity" column. I need a way to rename/organize the activities in the dataframe.

Raw Dataset looks like this:

Position	Activity	Max_Velocity (mph)
WR	Special Team 1	20
WR	special team 1	19
WR	Special team 1	18
WR	special Team 2	13
WR	ST 2	16
WR	St 2	19
WR	team 1	12
WR	team 1	20
WR	team 1	23
WR	team 2	21
WR	team 2	10
WR	team 2	3
WR	team 3	21
WR	team 3	11
WR	team 3	16
WR	Indy 1	20
WR	indy 2	21
WR	INDY 3	22

I need it to look like this:

Position	Activity	Max_Velocity (mph)	Activity_Catagory
WR	Special Team 1	20	Special Team
WR	special team 1	19	Special Team
WR	Special team 1	18	Special Team
WR	special Team 2	13	Special Team
WR	ST 2	16	Special Team
WR	St 2	19	Special Team
WR	team 1	12	Team
WR	team 1	20	Team
WR	team 1	23	Team
WR	team 2	21	Team
WR	team 2	10	Team
WR	team 2	3	Team
WR	team 3	21	Team
WR	team 3	11	Team
WR	team 3	16	Team
WR	Indy 1	20	Indy
WR	indy 2	21	Indy
WR	INDY 3	22	Indy

This way I can analyze the average of the dependent variable "Max_velocity" during each type of activity. I would rather not make individual lines of code correcting each typo in the "Activity" column. I think this can be done with loops but I am a beginner and don't know where to start. I would greatly appreciate any help to make this a simple fix.

Upvotes: 0

Answers (2)

B. Christian Kamgang

Reputation: 6529

Here is another way to solve your problem

library(stringr)

df$Activity_Catagory <- str_to_title(df$Activity) |> 
                          str_remove(" \\d") |> 
                          str_replace("St", "Special Team")



#    Position       Activity Max_Velocity (mph) Activity_Catagory
# 1        WR Special Team 1                 20      Special Team
# 2        WR special team 1                 19      Special Team
# 3        WR Special team 1                 18      Special Team
# 4        WR special Team 2                 13      Special Team
# 5        WR           ST 2                 16      Special Team
# 6        WR           St 2                 19      Special Team
# 7        WR         team 1                 12              Team
# 8        WR         team 1                 20              Team
# 9        WR         team 1                 23              Team
# 10       WR         team 2                 21              Team
# 11       WR         team 2                 10              Team
# 12       WR         team 2                  3              Team
# 13       WR         team 3                 21              Team
# 14       WR         team 3                 11              Team
# 15       WR         team 3                 16              Team
# 16       WR         Indy 1                 20              Indy
# 17       WR         indy 2                 21              Indy
# 18       WR         INDY 3                 22              Indy

Upvotes: 2

Ronak Shah

Reputation: 389235

You may need to first define the rules based on your data. Which values would be called as 'Special Team', which ones would be called as 'Team' etc.

Based on the example shared we can do case-insensitive pattern matching with grepl and list the conditions using case_when -

library(dplyr)

df %>%
  mutate(Activity_Category = case_when(grepl('Special Team', Activity, ignore.case = TRUE) | 
                                       grepl('ST ', Activity, ignore.case = TRUE) ~ 'Special Team', 
                                       grepl('Team', Activity, ignore.case = TRUE) ~ 'Team', 
                                       grepl('Indy', Activity, ignore.case = TRUE) ~ 'Indy'))

#   Position       Activity Max_Velocity..mph. Activity_Category
#1        WR Special Team 1                 20      Special Team
#2        WR special team 1                 19      Special Team
#3        WR Special team 1                 18      Special Team
#4        WR special Team 2                 13      Special Team
#5        WR           ST 2                 16      Special Team
#6        WR           St 2                 19      Special Team
#7        WR         team 1                 12              Team
#8        WR         team 1                 20              Team
#9        WR         team 1                 23              Team
#10       WR         team 2                 21              Team
#11       WR         team 2                 10              Team
#12       WR         team 2                  3              Team
#13       WR         team 3                 21              Team
#14       WR         team 3                 11              Team
#15       WR         team 3                 16              Team
#16       WR         Indy 1                 20              Indy
#17       WR         indy 2                 21              Indy
#18       WR         INDY 3                 22              Indy

Upvotes: 0

R data frame organization

Answers (2)

Related Questions