Classify population based on geographic distribution with machine learning

Question

I have a data set of an animal population "1" consisting of 100 observations and an animal population "2" consisting of 10,000 observations. For all observations the geographical data are available, in my case the x- and y-axis values. 70% of the two data sets are classified and 30% are unclassified. I now want to use the 70% of the classified observations to train a model that estimates the class (species) of the unclassified 30%.
I submit this request to answer the following questions:

I would use supervised machine learning classification. Maybe support vector machines (SVM) or random forest (RF). Or does anyone suggest a completely different method? Maybe clustering?
Another important consideration: What else should I consider in the analysis? What can be problematic to evaluate? For example, the significantly different size of the two data sets.

You can see my code for the creation of the example dataset and the visualization below. The black points represent the unclassified observations.

library(tidyverse)

set.seed(123)

near <- 10   #Proximity of the two populations
rate <- 0.7  #Proportion of the classified data set

df <- bind_rows(
  tibble(
    x = rnorm(n = 100, mean = 100, sd = 2),
    y = rnorm(n = 100, mean = 100, sd = 2),
    species = "species_1",
    classification = c(rep("classified", rate*100), rep("unclassified", 100-(rate*100)))
  ),
  tibble(
    x = rnorm(n = 10000, mean = 100, sd = 2) + near,
    y = rnorm(n = 10000, mean = 100, sd = 2) - near,
    species = "species_2",
    classification = c(rep("classified", rate*10000), rep("unclassified", 10000-(rate*10000)))
  )
)

df%>%
  ggplot()+
  geom_point(data = df%>%filter(classification=="classified"),aes(x=x,y=y, color=species))+
  geom_point(data = df%>%filter(classification=="unclassified"),aes(x=x,y=y), color="black")

Classify population based on geographic distribution with machine learning

Answers (1)

1, Naive Bayes Method:

2. Logistic Regression

3. SVM (support vector machine)

4. LDA (Linear Discriminant Analysis)

5. Random Forests

6. KNN (K-Nearest Neighbours)

Related Questions