prfarlow
prfarlow

Reputation: 4361

load_files in scikit-learn not loading all files in directory

I have a folder called 'emails' with two subfolders named after the label corresponding to the classification of files they have (spam or notspam emails, all are .txt files). There are 3000 files across the two subfolders. Using load_files:

data = load_files('emails', shuffle='False')
print len(data)
print len(data.target)

This prints '5' and then '3000'. How can the length of data only be 5 if it found 3000 classification labels?

Upvotes: 1

Views: 2344

Answers (1)

Bidhan Bhattarai
Bidhan Bhattarai

Reputation: 1060

Your data is stored in data.data and target in data.target. Try print(len(data.data)) instead.

load_files() simply returns a sklearn.datasets.base.Bunch, which is a simple data wrapper. So, data is in this format:

{
'DESCR': None,
 'data': [],
 'filenames': array(),
 'target': array(),
 'target_names': []
}

This is why len(data) returns 5.

Hope this helps!

Upvotes: 3

Related Questions