Reputation: 11935
I'm reading a tutorial about SVM.
He wrote there:
The Support Vector Machine, in general, handles pointless data better than the K Nearest Neighbors algorithm
What does he mean by "pointless data"?
Upvotes: 0
Views: 112
Reputation: 3527
In this context, it is used to describe data that any classification decision should not be based on. In this particular case, the author refers to an ID
column which contains a row identifier. They deem this data to be irrelevant for the decision task and therefore call it "meaningless" and even "misleading".
It's easier to understand with more context from the article (emphasis mine):
Note that if we comment out the drop id column part, accuracy goes back down into the 60s. The Support Vector Machine, in general, handles pointless data better than the K Nearest Neighbors algorithm, and definitely will handle outliers better, but, in this example, the meaningless data is still very misleading for us.
This is further corroborated in an earlier part of the series (emphasis mine):
The result should be about 95%, and that's out of the box without any tweaking. Very cool! Just for show, let's show what happens when we do indeed include truly meaningless and misleading data by commenting out the dropping of the id column:
Whether or not that assessment is correct depends on the actual dataset. If there is enough collected data to get satisfying results from, then it's probably a good idea to remove such a column. On the other hand, it's possible to imagine a hypothetical example where the ID
column is generated along with the data and contains an auto-incremented integer. Now it holds information about the sequence of the entries. If in the dataset there happens to be no other sequence information (e.g. timestamps), then the ID
column may not be meaningless.
Upvotes: 2
Reputation: 2523
The sentence refers to the sentence before that:
Note that if we comment out the drop id column part, accuracy goes back down into the 60s.
and the KNearestNeighbors tutorial where the change in model performance is investigated if 'useless' data (aka noise), like the indices of the data points, is fed to the model as input.
[...] let's show what happens when we do indeed include truly meaningless and misleading data by commenting out the dropping of the id column
The conclusion here is that SVMs handle meaningless features, noise or 'pointless data' in the input better than KNNs.
Upvotes: 2