TobKel
TobKel

Reputation: 1453

Classify population based on geographic distribution with machine learning

I have a data set of an animal population "1" consisting of 100 observations and an animal population "2" consisting of 10,000 observations. For all observations the geographical data are available, in my case the x- and y-axis values. 70% of the two data sets are classified and 30% are unclassified. I now want to use the 70% of the classified observations to train a model that estimates the class (species) of the unclassified 30%.
I submit this request to answer the following questions:

  1. I would use supervised machine learning classification. Maybe support vector machines (SVM) or random forest (RF). Or does anyone suggest a completely different method? Maybe clustering?
  2. Another important consideration: What else should I consider in the analysis? What can be problematic to evaluate? For example, the significantly different size of the two data sets.

You can see my code for the creation of the example dataset and the visualization below. The black points represent the unclassified observations.

library(tidyverse)

set.seed(123)

near <- 10   #Proximity of the two populations
rate <- 0.7  #Proportion of the classified data set

df <- bind_rows(
  tibble(
    x = rnorm(n = 100, mean = 100, sd = 2),
    y = rnorm(n = 100, mean = 100, sd = 2),
    species = "species_1",
    classification = c(rep("classified", rate*100), rep("unclassified", 100-(rate*100)))
  ),
  tibble(
    x = rnorm(n = 10000, mean = 100, sd = 2) + near,
    y = rnorm(n = 10000, mean = 100, sd = 2) - near,
    species = "species_2",
    classification = c(rep("classified", rate*10000), rep("unclassified", 10000-(rate*10000)))
  )
)

df%>%
  ggplot()+
  geom_point(data = df%>%filter(classification=="classified"),aes(x=x,y=y, color=species))+
  geom_point(data = df%>%filter(classification=="unclassified"),aes(x=x,y=y), color="black")

enter image description here

Upvotes: 0

Views: 55

Answers (1)

Onyambu
Onyambu

Reputation: 79338

Set the train and test datasets.

test <- subset(df, classification == 'unclassified')
train <- subset(df, classification == 'classified')

1, Naive Bayes Method:

mod1 <- e1071:::naiveBayes(species~x+y, train)
table(train$species, predict(mod1, train))
table(test$species, predict(mod1, test))

2. Logistic Regression

mod2 <- glm(factor(species)~x+y, binomial(), train)
table(train$species, mod2$fitted.values>0.5)
table(test$species, predict(mod2, test, type = 'response')>0.5)

3. SVM (support vector machine)

mod3 <- e1071::svm(factor(species)~x+y, train)
table(train$species, predict(mod3, train))
table(test$species, predict(mod3, test))

4. LDA (Linear Discriminant Analysis)

mod4 <- MASS::lda(species~x+y, df)
table(train$species, predict(mod4, train)$class)
table(test$species, predict(mod4, test)$class)

5. Random Forests

mod5 <- randomForest::randomForest(factor(species)~x+y, df)
table(train$species, predict(mod5, train))
table(test$species, predict(mod5, test))

6. KNN (K-Nearest Neighbours)

table(test$species, class::knn(train[c('x','y')], test[c('x','y')], train$species,5))

Other methods:

  • Decision Trees
  • Adaboost
  • XGboost

Upvotes: 1

Related Questions