Reputation: 45
I am a beginner data scientist working within an NCAA football program trying to learn R. We record repeated measures data during practice that needs to be organized then analyzed. I have this sample dataset that I am having trouble analyzing due to the variations/typos in the "Activity" column. I need a way to rename/organize the activities in the dataframe.
Raw Dataset looks like this:
Position | Activity | Max_Velocity (mph) |
---|---|---|
WR | Special Team 1 | 20 |
WR | special team 1 | 19 |
WR | Special team 1 | 18 |
WR | special Team 2 | 13 |
WR | ST 2 | 16 |
WR | St 2 | 19 |
WR | team 1 | 12 |
WR | team 1 | 20 |
WR | team 1 | 23 |
WR | team 2 | 21 |
WR | team 2 | 10 |
WR | team 2 | 3 |
WR | team 3 | 21 |
WR | team 3 | 11 |
WR | team 3 | 16 |
WR | Indy 1 | 20 |
WR | indy 2 | 21 |
WR | INDY 3 | 22 |
I need it to look like this:
Position | Activity | Max_Velocity (mph) | Activity_Catagory |
---|---|---|---|
WR | Special Team 1 | 20 | Special Team |
WR | special team 1 | 19 | Special Team |
WR | Special team 1 | 18 | Special Team |
WR | special Team 2 | 13 | Special Team |
WR | ST 2 | 16 | Special Team |
WR | St 2 | 19 | Special Team |
WR | team 1 | 12 | Team |
WR | team 1 | 20 | Team |
WR | team 1 | 23 | Team |
WR | team 2 | 21 | Team |
WR | team 2 | 10 | Team |
WR | team 2 | 3 | Team |
WR | team 3 | 21 | Team |
WR | team 3 | 11 | Team |
WR | team 3 | 16 | Team |
WR | Indy 1 | 20 | Indy |
WR | indy 2 | 21 | Indy |
WR | INDY 3 | 22 | Indy |
This way I can analyze the average of the dependent variable "Max_velocity" during each type of activity. I would rather not make individual lines of code correcting each typo in the "Activity" column. I think this can be done with loops but I am a beginner and don't know where to start. I would greatly appreciate any help to make this a simple fix.
Upvotes: 0
Views: 56
Reputation: 6489
Here is another way to solve your problem
library(stringr)
df$Activity_Catagory <- str_to_title(df$Activity) |>
str_remove(" \\d") |>
str_replace("St", "Special Team")
# Position Activity Max_Velocity (mph) Activity_Catagory
# 1 WR Special Team 1 20 Special Team
# 2 WR special team 1 19 Special Team
# 3 WR Special team 1 18 Special Team
# 4 WR special Team 2 13 Special Team
# 5 WR ST 2 16 Special Team
# 6 WR St 2 19 Special Team
# 7 WR team 1 12 Team
# 8 WR team 1 20 Team
# 9 WR team 1 23 Team
# 10 WR team 2 21 Team
# 11 WR team 2 10 Team
# 12 WR team 2 3 Team
# 13 WR team 3 21 Team
# 14 WR team 3 11 Team
# 15 WR team 3 16 Team
# 16 WR Indy 1 20 Indy
# 17 WR indy 2 21 Indy
# 18 WR INDY 3 22 Indy
Upvotes: 2
Reputation: 388797
You may need to first define the rules based on your data. Which values would be called as 'Special Team', which ones would be called as 'Team' etc.
Based on the example shared we can do case-insensitive pattern matching with grepl
and list the conditions using case_when
-
library(dplyr)
df %>%
mutate(Activity_Category = case_when(grepl('Special Team', Activity, ignore.case = TRUE) |
grepl('ST ', Activity, ignore.case = TRUE) ~ 'Special Team',
grepl('Team', Activity, ignore.case = TRUE) ~ 'Team',
grepl('Indy', Activity, ignore.case = TRUE) ~ 'Indy'))
# Position Activity Max_Velocity..mph. Activity_Category
#1 WR Special Team 1 20 Special Team
#2 WR special team 1 19 Special Team
#3 WR Special team 1 18 Special Team
#4 WR special Team 2 13 Special Team
#5 WR ST 2 16 Special Team
#6 WR St 2 19 Special Team
#7 WR team 1 12 Team
#8 WR team 1 20 Team
#9 WR team 1 23 Team
#10 WR team 2 21 Team
#11 WR team 2 10 Team
#12 WR team 2 3 Team
#13 WR team 3 21 Team
#14 WR team 3 11 Team
#15 WR team 3 16 Team
#16 WR Indy 1 20 Indy
#17 WR indy 2 21 Indy
#18 WR INDY 3 22 Indy
Upvotes: 0