Reputation: 1
Last week I failed at an exam because of one task. I would be very glad if someone could identify the mistake in my code and help me make sense of what I could improve.
Specifically, I wonder if the to-be Nominal columns must be defined specifically by calling factor(). Same applies to the other columns - should I specify continuous to be "double", discrete to be "integer", ordinal to be "factor" etc. And if so, what would the date column be, as it is supposed to be discrete... would it be "date" or "int"?
The task was as follows:
clean_data
.Column Name | Criteria |
---|---|
house_id | Nominal. Unique identifier for houses. Missing values not possible. |
city | Nominal. The city in which the house is located. One of 'Silvertown', 'Riverford', 'Teasdale' and 'Poppleton' Replace missing values with "Unknown". |
sale_price | Discrete. The sale price of the house in whole dollars. Values can be any positive number greater than or equal to zero.Remove missing entries. |
sale_date | Discrete. The date of the last sale of the house. Replace missing values with 2023-01-01. |
months_listed | Continuous. The number of months the house was listed on the market prior to its last sale, rounded to one decimal place. Replace missing values with mean number of months listed, to one decimal place. |
bedrooms | Discrete. The number of bedrooms in the house. Any positive values greater than or equal to zero. Replace missing values with the mean number of bedrooms, rounded to the nearest integer. |
house_type | Ordinal. One of "Terraced", "Semi-detached", or "Detached". Replace missing values with the most common house type. |
area | Continuous. The area of the house in square meters, rounded to one decimal place. Replace missing values with the mean, to one decimal place. |
I am very much looking forward to your insights! Note. This is a piece of code I have drafted after the exam, it might even be correct, since I haven't submitted it in this exact form.
Background info:
> str(house_sales)
'data.frame': 1500 obs. of 8 variables:
$ house_id : int 1217792 1900913 1174927 1773666 1258487 1539789 1353069 1155864 1608726 1358764 ...
$ city : chr "Silvertown" "Silvertown" "Riverford" "Silvertown" ...
$ sale_price : int 55943 384677 281707 373251 328885 47143 429914 284440 87134 211203 ...
$ sale_date : chr "2021-09-12" "2021-01-17" "2021-11-10" "2020-04-13" ...
$ months_listed: num 5.4 6.3 6.9 6.1 8.7 5.1 NA 5.8 4.4 3 ...
$ bedrooms : int 2 5 6 6 5 2 6 4 2 4 ...
$ house_type : chr "Semi-detached" "Detached" "Detached" "Det." ...
$ area : chr "107.8 sq.m." "498.8 sq.m." "542.5 sq.m." "528.4 sq.m." ...
> str(clean_data)
'data.frame': 1500 obs. of 8 variables:
$ house_id : int 1217792 1900913 1174927 1773666 1258487 1539789 1353069 1155864 1608726 1358764 ...
$ city : chr "Silvertown" "Silvertown" "Riverford" "Silvertown" ...
$ sale_price : num 55943 384677 281707 373251 328885 ...
$ sale_date : chr "2021-09-12" "2021-01-17" "2021-11-10" "2020-04-13" ...
$ months_listed: num 5.4 6.3 6.9 6.1 8.7 5.1 5.9 5.8 4.4 3 ...
$ bedrooms : num 2 5 6 6 5 2 6 4 2 4 ...
$ house_type : Factor w/ 3 levels "Terraced","Semi-detached",..: 2 3 3 3 3 2 3 3 2 3 ...
$ area : num 108 499 542 528 477 ...
My code:
library(dplyr)
library(stringr)
house_sales <- read.csv("Downloads/house_sales.csv")
levels(as.factor(house_sales$house_type))
clean_data <- house_sales %>%
# house_id | Nominal. </br> Unique identifier for houses. </br>Missing values not possible.
# Remove duplicate rows based on house_id
distinct(house_id, .keep_all = TRUE) %>%
# city | Nominal. </br>The city in which the house is located. One of 'Silvertown', 'Riverford', 'Teasdale' and 'Poppleton' </br>Replace missing values with "Unknown".
# Replace missing values in city with "Unknown"
mutate(city = ifelse(city == "--", "Unknown", city)) %>%
# sale_price | Discrete. </br>The sale price of the house in whole dollars. Values can be any positive number greater than or equal to zero.</br>Remove missing entries.
# Replace missing values in sale_price with NA, remove rows with missing sale_price
mutate(sale_price = ifelse(sale_price >= 0, round(sale_price,0), NA)) %>%
filter(!is.na(sale_price)) %>%
# sale_date | Discrete. </br>The date of the last sale of the house. </br>Replace missing values with 2023-01-01.
# Replace missing values in sale_date with "2023-01-01"
mutate(sale_date = ifelse(is.na(sale_date), "2023-01-01", sale_date)) %>%
# months_listed | Continuous. </br>The number of months the house was listed on the market prior to its last sale, rounded to one decimal place. </br>Replace missing values with mean number of months listed, to one decimal place.
# Replace missing values in months_listed with the mean (rounded to one decimal place)
mutate(months_listed = round(ifelse(is.na(months_listed), mean(months_listed, na.rm = TRUE), months_listed),1)) %>%
# bedrooms | Discrete. </br>The number of bedrooms in the house. Any positive values greater than or equal to zero. </br>Replace missing values with the mean number of bedrooms, rounded to the nearest integer.
# Replace missing values in bedrooms with the mean (rounded to the nearest integer)
mutate(bedrooms = round(ifelse(is.na(bedrooms), mean(bedrooms, na.rm = TRUE), bedrooms))) %>%
# house_type | Ordinal. </br>One of "Terraced", "Semi-detached", or "Detached". </br>Replace missing values with the most common house type.
# Convert variant names to standard names
mutate(house_type = str_replace(house_type, "^Det\\.$", "Detached")) %>%
mutate(house_type = str_replace(house_type, "^Semi$", "Semi-detached")) %>%
mutate(house_type = str_replace(house_type, "^Terr\\.$", "Terraced")) %>%
# Convert to factor
mutate(house_type = factor(house_type, levels = c("Terraced", "Semi-detached", "Detached"))) %>%
# area | Continuous. </br>The area of the house in square meters, rounded to one decimal place. </br>Replace missing values with the mean, to one decimal place.
# Replace missing values in area with the mean (rounded to one decimal place)
mutate(area = round(as.numeric(gsub(" sq.m.", "", area)),1)) %>%
mutate(area = ifelse(is.na(area), round(mean(area, na.rm = TRUE),1), area))
Upvotes: 0
Views: 190
Reputation: 620
The most obvious issue at a glance is that your one-line code is unacceptablely long.
I don't know your marking criteria for the exam, so I can only comment with respect general coding conventions.
You should know that mutate()
can handle multiple columns, right? So, clearly you don't need 10 mutate()
. Also, for 'sale_date ', you can convert it to date using as.Date()
, and modify 'house_type' using case_when()
.
Upvotes: 0