Reputation: 11
ds <- read.csv(file="~/Desktop/abalone.csv)
Sex Length
M 12.1
F 11.2
12.1 F
I 11.2
12.1 11.2
11.2 I
The dataset contains two columns which are sex and Length. The sex column contains three values which are "F" (Female), "M" (Male) and "I" (infant). The dataset has been modified and has several issues which I need to solve. I need to solve the following problems
My idea to solve the misfielded values is to swap the data between two columns based on the value in Length. let say if the value in Length is "F", "M" and "I", then swap the data with Sex. I really have no idea how to swap the data between these two column based on the value in length.
swap_if <-function (ds$Sex, ds$Length, missing = NA){
a <- ds$Length == "M"
b <- ds$Length == "F"
c <- ds$Length == "I"}
swap_if(ds$Sex, ds$Length)
but this cannot work. can someone help me with this. thank you so much
After that, I need to remove the invalid values. the fifth row first column has an invalid values as the value of Sex must be 'F', 'M' and 'I'.
#remove wrong format /invalid data in sex column
ds <- ds[!str_detect(ds$Sex, '([0-9])'), ]
ds$Sex <- as.factor(ds$Sex)
^ this code can remove the row with invalid / wrong format values.
Upvotes: 1
Views: 233
Reputation: 160407
Despite what I recommended (fixing the source of the problem), here's a method for trying to recover.
column <- grepl("^[.0-9]+$", dat[,1])
column
# [1] FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
dat2 <- data.frame(Sex = dat[cbind(seq_len(nrow(dat)),1+column)], Length = dat[cbind(seq_len(nrow(dat)),2-column)])
dat2$Length <- as.numeric(dat2$Length)
dat2
# Sex Length
# 1 M 12.2
# 2 F 14.1
# 3 M 14.6
# 4 F 9.0
# 5 L 12.1
# 6 M 10.1
# 7 F 11.0
# 8 M 11.9
A tidyverse option:
library(dplyr)
dat %>%
mutate(
swap = grepl("[^.[:digit:]]", Length),
Length2 = if_else(swap, Sex, Length),
Sex2 = if_else(swap, Length, Sex)
) %>%
transmute(
Sex = Sex2,
Length = as.numeric(Length2)
)
# Sex Length
# 1 M 12.2
# 2 F 14.1
# 3 M 14.6
# 4 F 9.0
# 5 L 12.1
# 6 M 10.1
# 7 F 11.0
# 8 M 11.9
Data:
dat <- structure(list(Sex = c("M", "F", "M", "9", "L", "10.1", "11",
"M"), Length = c("12.2", "14.1", "14.6", "F", "12.1", "M", "F",
"11.9")), class = "data.frame", row.names = c(NA, -8L))
Upvotes: 1
Reputation: 4524
Here is another solution which works with tidyverse
. It creates a little helper column which will be deleted at the end.
library(tidyverse)
# create the inverse of `%in%`
`%!in%` = Negate(`%in%`)
df %>%
# create a helper column, will be deleted
mutate(help = Sex) %>%
mutate(Sex = case_when(
Sex %!in% c('L', 'M', 'F') ~ Length,
TRUE ~ Sex
)) %>%
mutate(Length = case_when(
Length %in% c('L', 'M', 'F') ~ help,
TRUE ~ Length
)) %>%
# delete helper column
select(-help)
#> Sex Length
#> 1 M 12.2
#> 2 F 14.1
#> 3 M 14.6
#> 4 F 9
#> 5 L 12.1
#> 6 M 10.1
#> 7 F 11
#> 8 M 11.9
Upvotes: 1
Reputation: 983
You can set up a for loop with an if-then to check if the character variable can safely be converted to a numeric. If so, it belongs in length
.
sex = c('M','F', 'M', 9, 'L', 10.1, 11, 'M')
length = c(12.2, 14.1, 14.6, 'F', 12.1, 'M', 'F', 11.9)
df = data.frame(sex, length)
str(df)
df$sex.c = 0
df$length.c = 0
for (i in 1:length(df$sex)){
if (!is.na(as.numeric(df$sex[i]))){
df$sex.c[i] = df$length[i]
df$length.c[i] = df$sex[i]
} else {
df$sex.c[i] = df$sex[i]
df$length.c[i] = df$length[i]
}
}
df$sex = df$sex.c
df$length = df$length.c
df = df[,1:2]
Note that this creates a NAs introduced by coercion
warning. If that bothers you then you can suppress them by altering the for loop as follows:
for (i in 1:length(df$sex)){
if (suppressWarnings(!is.na(as.numeric(df$sex[i])))){
df$sex.c[i] = df$length[i]
df$length.c[i] = df$sex[i]
} else {
df$sex.c[i] = df$sex[i]
df$length.c[i] = df$length[i]
}
}
Upvotes: 0