Ranjana Girish
Ranjana Girish

Reputation: 473

text classification of large dataset in python

I have 2.2 million data samples to classify into more than 7500 categories. I am using pandas and sckit-learn of python to do so.

Below is the sample of my dataset

itemid       description                                            category
11802974     SPRO VUH3C1 DIFFUSER VUH1 TRIPLE Space heaters    Architectural Diffusers
10688548     ANTIQUE BRONZE FINISH PUSHBUTTON  switch           Door Bell Pushbuttons
9836436     Descente pour Cable tray fitting and accessories    Tray Cable Drop Outs

Below are the steps I have followed:

  1. Pre-processing
  2. Vector representation
  3. Training

     dataset=pd.read_csv("trainset.csv",encoding = "ISO-8859-1",low_memory=False)
     dataset['description']=dataset['description'].str.replace('[^a-zA-Z]', ' ')
     dataset['description']=dataset['description'].str.replace('[\d]', ' ')
     dataset['description']=dataset['description'].str.lower()
    
     stop = stopwords.words('english')
     lemmatizer = WordNetLemmatizer()
    
      dataset['description']=dataset['description'].str.replace(r'\b(' + r'|'.join(stop) + r')\b\s*', ' ')
      dataset['description']=dataset['description'].str.replace('\s\s+',' ')
      dataset['description'] =dataset['description'].apply(word_tokenize)
      ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'
      POS_LIST = [NOUN, VERB, ADJ, ADV]
      for tag in POS_LIST:
      dataset['description'] = dataset['description'].apply(lambda x: 
      list(set([lemmatizer.lemmatize(item,tag) for item in x])))
      dataset['description']=dataset['description'].apply(lambda x : " ".join(x))
    
    
     countvec = CountVectorizer(min_df=0.0005)
     documenttermmatrix=countvec.fit_transform(dataset['description'])
     column=countvec.get_feature_names()
    
     y_train=dataset['category']
     y_train=dataset['category'].tolist()
    
     del dataset
     del stop
     del tag
    

The documenttermmatrix generated will be of type scipy csr matrix with 12k features and 2.2 million samples.

For training I tried using xgboost of sckit learn

model = XGBClassifier(silent=False,n_estimators=500,objective='multi:softmax',subsample=0.8)
model.fit(documenttermmatrix,y_train,verbose=True)

After 2-3 minutes execution of above code i got error

OSError: [WinError 541541187] Windows Error 0x20474343

I also tried Naive Bayes of sckit learn for which i got memory error

Question

I have used Scipy matrix which consumes very less memory and also I am deleting all the unused objects before executing xgboost or Naive bayes, I am using system with 128GB RAM but still getting memory issue while training.

I am new to python.Is there any thing wrong in my code? can anyone tell how can I use Memory efficiently and proceed further?

Upvotes: 4

Views: 784

Answers (1)

user9048861
user9048861

Reputation:

I think I can explain the problem in your code. The OS error appears to be:

"

ERROR_DS_RIDMGR_DISABLED
8263 (0x2047)

The directory service detected the subsystem that allocates relative identifiers is disabled. This can occur as a protective mechanism when the system determines a significant portion of relative identifiers (RIDs) have been exhausted.

" via https://msdn.microsoft.com/en-us/library/windows/desktop/ms681390

I think you exhausted a significant portion of the RIDs at this step in your code:

dataset['description'] = dataset['description'].apply(lambda x: 
list(set([lemmatizer.lemmatize(item,tag) for item in x])))

You're passing a lemmatizer in your lambda, but lambdas are anonymous, so it looks like you might be making 2.2 million copies of that lemmatizer at runtime.

You should try changing the low_memory flag to true whenever you have a memory issue.

Response to comment-

I checked the Pandas documentation, and you can define a function outside of dataset['description'].apply(), and then reference that function in the call to dataset['description'].apply(). Here is how I would write said function.

def lemmatize_descriptions(x):
return list(set([lemmatizer.lemmatize(item,tag) for item in x]))

Then, the call to apply() would be-

dataset['description'] = dataset['description'].apply(lemmatize_descriptions)

Here is the documentation.

Upvotes: 6

Related Questions