gal007
gal007

Reputation: 7182

Labels of datasets imported with sklearn.datasets.load_files

I'm wondering how to match the labels produced by a SVN classifier with the ones on my dataset. ANd then I realized that the problem starts at the begining: when I load the dataset I got a dataset which in my case has the following properties:

.data = the news text
.target_names = label used in the dataset e.g. ["positive", "negative"]
.target = A matrix with a number for each news with a label.

But I,m wondering if the order og the target_names is different across different datasets (with the sametags but different news), and if the order of the .data elements influences that.

Is there any way to easily know the label of a number in the .target matrix? (I mean, what does 0 or 1 represents in such a matrix)

Best,

Upvotes: 5

Views: 1304

Answers (1)

rvf
rvf

Reputation: 1449

The corresponding label for an entry i in .target is available as .target_names[i]. In your example: .target_names[1] is "negative".

The order of the target names will be the same across different datasets, as long as the tags are exactly the same. This is because sklearn.datasets.load_files() creates the tags from the sorted folder names, as we can see in the source code (v.20.x):

[...]
folders = [f for f in sorted(listdir(container_path))
           if isdir(join(container_path, f))]

if categories is not None:
    folders = [f for f in folders if f in categories]

for label, folder in enumerate(folders):
    target_names.append(folder)
[...]

I'd still suggest to always retrieve the label from target_names of the current dataset to be on the safe side (implementations may change over time etc.)

Upvotes: 6

Related Questions