Reputation: 1453
I have a data set of an animal population "1" consisting of 100 observations and an animal population "2" consisting of 10,000 observations.
For all observations the geographical data are available, in my case the x- and y-axis values.
70% of the two data sets are classified and 30% are unclassified.
I now want to use the 70% of the classified observations to train a model that estimates the class (species) of the unclassified 30%.
I submit this request to answer the following questions:
You can see my code for the creation of the example dataset and the visualization below. The black points represent the unclassified observations.
library(tidyverse)
set.seed(123)
near <- 10 #Proximity of the two populations
rate <- 0.7 #Proportion of the classified data set
df <- bind_rows(
tibble(
x = rnorm(n = 100, mean = 100, sd = 2),
y = rnorm(n = 100, mean = 100, sd = 2),
species = "species_1",
classification = c(rep("classified", rate*100), rep("unclassified", 100-(rate*100)))
),
tibble(
x = rnorm(n = 10000, mean = 100, sd = 2) + near,
y = rnorm(n = 10000, mean = 100, sd = 2) - near,
species = "species_2",
classification = c(rep("classified", rate*10000), rep("unclassified", 10000-(rate*10000)))
)
)
df%>%
ggplot()+
geom_point(data = df%>%filter(classification=="classified"),aes(x=x,y=y, color=species))+
geom_point(data = df%>%filter(classification=="unclassified"),aes(x=x,y=y), color="black")
Upvotes: 0
Views: 55
Reputation: 79338
Set the train and test datasets.
test <- subset(df, classification == 'unclassified')
train <- subset(df, classification == 'classified')
mod1 <- e1071:::naiveBayes(species~x+y, train)
table(train$species, predict(mod1, train))
table(test$species, predict(mod1, test))
mod2 <- glm(factor(species)~x+y, binomial(), train)
table(train$species, mod2$fitted.values>0.5)
table(test$species, predict(mod2, test, type = 'response')>0.5)
mod3 <- e1071::svm(factor(species)~x+y, train)
table(train$species, predict(mod3, train))
table(test$species, predict(mod3, test))
mod4 <- MASS::lda(species~x+y, df)
table(train$species, predict(mod4, train)$class)
table(test$species, predict(mod4, test)$class)
mod5 <- randomForest::randomForest(factor(species)~x+y, df)
table(train$species, predict(mod5, train))
table(test$species, predict(mod5, test))
table(test$species, class::knn(train[c('x','y')], test[c('x','y')], train$species,5))
Other methods:
Upvotes: 1