Reputation: 473
I have 2.2 million data samples to classify into more than 7500 categories. I am using pandas and sckit-learn of python to do so.
Below is the sample of my dataset
itemid description category
11802974 SPRO VUH3C1 DIFFUSER VUH1 TRIPLE Space heaters Architectural Diffusers
10688548 ANTIQUE BRONZE FINISH PUSHBUTTON switch Door Bell Pushbuttons
9836436 Descente pour Cable tray fitting and accessories Tray Cable Drop Outs
Below are the steps I have followed:
Training
dataset=pd.read_csv("trainset.csv",encoding = "ISO-8859-1",low_memory=False)
dataset['description']=dataset['description'].str.replace('[^a-zA-Z]', ' ')
dataset['description']=dataset['description'].str.replace('[\d]', ' ')
dataset['description']=dataset['description'].str.lower()
stop = stopwords.words('english')
lemmatizer = WordNetLemmatizer()
dataset['description']=dataset['description'].str.replace(r'\b(' + r'|'.join(stop) + r')\b\s*', ' ')
dataset['description']=dataset['description'].str.replace('\s\s+',' ')
dataset['description'] =dataset['description'].apply(word_tokenize)
ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'
POS_LIST = [NOUN, VERB, ADJ, ADV]
for tag in POS_LIST:
dataset['description'] = dataset['description'].apply(lambda x:
list(set([lemmatizer.lemmatize(item,tag) for item in x])))
dataset['description']=dataset['description'].apply(lambda x : " ".join(x))
countvec = CountVectorizer(min_df=0.0005)
documenttermmatrix=countvec.fit_transform(dataset['description'])
column=countvec.get_feature_names()
y_train=dataset['category']
y_train=dataset['category'].tolist()
del dataset
del stop
del tag
The documenttermmatrix generated will be of type scipy csr matrix with 12k features and 2.2 million samples.
For training I tried using xgboost of sckit learn
model = XGBClassifier(silent=False,n_estimators=500,objective='multi:softmax',subsample=0.8)
model.fit(documenttermmatrix,y_train,verbose=True)
After 2-3 minutes execution of above code i got error
OSError: [WinError 541541187] Windows Error 0x20474343
I also tried Naive Bayes of sckit learn for which i got memory error
Question
I have used Scipy matrix which consumes very less memory and also I am deleting all the unused objects before executing xgboost or Naive bayes, I am using system with 128GB RAM but still getting memory issue while training.
I am new to python.Is there any thing wrong in my code? can anyone tell how can I use Memory efficiently and proceed further?
Upvotes: 4
Views: 784
Reputation:
I think I can explain the problem in your code. The OS error appears to be:
"
ERROR_DS_RIDMGR_DISABLED
8263 (0x2047)
The directory service detected the subsystem that allocates relative identifiers is disabled. This can occur as a protective mechanism when the system determines a significant portion of relative identifiers (RIDs) have been exhausted.
" via https://msdn.microsoft.com/en-us/library/windows/desktop/ms681390
I think you exhausted a significant portion of the RIDs at this step in your code:
dataset['description'] = dataset['description'].apply(lambda x:
list(set([lemmatizer.lemmatize(item,tag) for item in x])))
You're passing a lemmatizer in your lambda, but lambdas are anonymous, so it looks like you might be making 2.2 million copies of that lemmatizer at runtime.
You should try changing the low_memory flag to true whenever you have a memory issue.
Response to comment-
I checked the Pandas documentation, and you can define a function outside of dataset['description'].apply(), and then reference that function in the call to dataset['description'].apply(). Here is how I would write said function.
def lemmatize_descriptions(x):
return list(set([lemmatizer.lemmatize(item,tag) for item in x]))
Then, the call to apply() would be-
dataset['description'] = dataset['description'].apply(lemmatize_descriptions)
Upvotes: 6