Reputation: 25
I am unable to group in larger "categories" from existent ones in the variables "Text_General_Code".
I tried to process the "Text_General_Code" stand alone. It gave me more than eight variables in my report file.
library(ggplot2)
library(lubridate)
library(zoo)
library(dplyr)
library(knitr)
library(plotly)
# Read csv in R
##
pdx = read.csv("https://cyo.arringtonadventures.com/crime/crime.csv",header = T)
head(pdx)
# Create a variable count with value 1
pdx$Count <- 1
# Convert Date from factor to date
#pdx$Date <- mdy_hms(pdx$Dispatch_Date_Time)
# Extract year from Date
pdx$Year <- substring(pdx$Dispatch_Date,1,4)
# Rename District from Dc_Dist
colnames(pdx)[1] <- "District"
# Drop all variables we are not interested in
#select(pdx, -2,-3,-5,-7,-8,-9,-11,-12,-13,-14)
# Group Text_General_Code by categories
pdx$Category[pdx$Text_General_Code == "THEFT" | pdx$Text_General_Code == "MOTOR VEHICLE THEFT"] <- "Theft"
pdx$Category[pdx$Text_General_Code == "BATTERY"] <- "Battery"
pdx$Category[pdx$Text_General_Code == "CRIMINAL DAMAGE"] <- "Criminal damage"
pdx$Category[pdx$Text_General_Code == "NARCOTICS" | pdx$Text_General_Code == "OTHER NARCOTIC VIOLATION"] <- "Narcotics"
pdx$Category[pdx$Text_General_Code == "ASSAULT"] <- "Assault"
pdx$Category[pdx$Text_General_Code == "BURGLARY"] <- "Burglary"
pdx$Category[pdx$Text_General_Code == "ROBBERY"] <- "ROBBERY"
pdx$Category[pdx$Text_General_Code == "ARSON" | pdx$Text_General_Code == "CONCEALED CARRY LICENSE VIOLATION" |
pdx$Text_General_Code == "CRIMINAL TRESPASS" | pdx$Text_General_Code == "GAMBLINGS" |
pdx$Text_General_Code == "HUMAN TRAFFICKING" | pdx$Text_General_Code == "INTERFERENCE WITH PUBLIC OFFICER" |
pdx$Text_General_Code == "INTIMIDATION" | pdx$Type == "KIDNAPPING" | pdx$Type == "LIQUOR LAW VIOLATION" |
pdx$Text_General_Code == "NON-CRIMINAL" | pdx$Text_General_Code == "NON - CRIMINAL" |
pdx$Text_General_Code == "OBSCENITY" | pdx$Text_General_Code == "OFFENSE INVOLVING CHILDREN"|
pdx$Text_General_Code == "PROSTITUTION" | pdx$Text_General_Code == "PUBLIC INDECENCY"|
pdx$Text_General_Code == "PUBLIC PEACE VIOLATION" | pdx$Text_General_Code == "STALKING"|
pdx$Text_General_Code == "WEAPONS VIOLATION"| pdx$Text_General_Code == "HOMICIDE" |
pdx$Text_General_Code == "CRIM SEXUAL ASSAULT" | pdx$Text_General_Code == "SEX OFFENSE" |
pdx$Text_General_Code == "DECEPTIVE PRACTICE" | pdx$Text_General_Code == "OTHER OFFENSE"] <- "Others"
I expect all of the variables to group into the variable "category". I should only get 'Assault', 'Battery', 'Burglary', 'Criminal damage', 'Narcotics', 'Robbery', 'Theft' and everything else should be grouped into 'Others'. I am getting "NA" in 'Category' variable.
Note: input dataset has 2.3M records, may take few minutes to run
Upvotes: 0
Views: 188
Reputation: 633
To start with, in the read.csv
statement, add stringsAsFactors=F
so that it doesn't have factor levels when you work with it. Also, might help to make sure the Text_General_Code field is all the same case:
pdx = read.csv("https://cyo.arringtonadventures.com/crime/crime.csv",header = T, stringsAsFactors=F) %>%
mutate(Text_General_Code = str_to_sentence(Text_General_Code))
Then do a count of the values in Text_General_Code, and maybe output it to an object you can inspect (assuming you're using Rstudio):
tgc <- pdx %>%
count(Text_General_Code)
view(tgc)
You'll see then that part of the issue is the things you're searching for in the section # Group Text_General_Code by categories don't actually exist. And one, "BATTERY", doesn't exist at all.
As a grouping strategy, you might want to try using a case_when statement in a dplyr chain:
pdx <- pdx %>%
mutate(category = case_when(Text_General_Code == "Thefts" |
Text_General_Code == "Motor Vehicle Theft" |
Text_General_Code == "Theft from Vehicle"
~ "Theft",
Text_General_Code == "Robbery Firearm" |
Text_General_Code == "Robbery No Firearm"
~ "Robbery"))
...etc, until you've grouped as you want to.
Then for QC, do a check:
pdx %>%
count(category, Text_General_Code)
Upvotes: 1