Alon
Alon

Reputation: 11935

What is pointless data?

I'm reading a tutorial about SVM.

He wrote there:

The Support Vector Machine, in general, handles pointless data better than the K Nearest Neighbors algorithm

What does he mean by "pointless data"?

Upvotes: 0

Views: 112

Answers (2)

snwflk
snwflk

Reputation: 3527

In this context, it is used to describe data that any classification decision should not be based on. In this particular case, the author refers to an ID column which contains a row identifier. They deem this data to be irrelevant for the decision task and therefore call it "meaningless" and even "misleading".

It's easier to understand with more context from the article (emphasis mine):

Note that if we comment out the drop id column part, accuracy goes back down into the 60s. The Support Vector Machine, in general, handles pointless data better than the K Nearest Neighbors algorithm, and definitely will handle outliers better, but, in this example, the meaningless data is still very misleading for us.

This is further corroborated in an earlier part of the series (emphasis mine):

The result should be about 95%, and that's out of the box without any tweaking. Very cool! Just for show, let's show what happens when we do indeed include truly meaningless and misleading data by commenting out the dropping of the id column:

Discussion

Whether or not that assessment is correct depends on the actual dataset. If there is enough collected data to get satisfying results from, then it's probably a good idea to remove such a column. On the other hand, it's possible to imagine a hypothetical example where the ID column is generated along with the data and contains an auto-incremented integer. Now it holds information about the sequence of the entries. If in the dataset there happens to be no other sequence information (e.g. timestamps), then the ID column may not be meaningless.

Upvotes: 2

Tinu
Tinu

Reputation: 2523

The sentence refers to the sentence before that:

Note that if we comment out the drop id column part, accuracy goes back down into the 60s.

and the KNearestNeighbors tutorial where the change in model performance is investigated if 'useless' data (aka noise), like the indices of the data points, is fed to the model as input.

[...] let's show what happens when we do indeed include truly meaningless and misleading data by commenting out the dropping of the id column

The conclusion here is that SVMs handle meaningless features, noise or 'pointless data' in the input better than KNNs.

Upvotes: 2

Related Questions