Rahul Bansal
Rahul Bansal

Reputation: 304

How do I approach this machine learning/NLP context aware text classification project? See the description below

I am a newbie in machine learning and NLP. I need help for my college project. It's’ actually a subtask of a major project. Description is as follow:

It is a classification problem. I will be given an item and I have to predict the shop type from where the item can be bought.
Examples:
Item Class-label (shop-type)
Pencil -> book store
Beer -> bar
Cash -> ATM
Tube Light -> electronics store
Medicine -> pharmacy
I am given many different class labels like stationary_shop,book_store,bakery,pharmacy etc.(around 50 labels)

Problems faced-
1. I don’t have much bigger dataset. I built a small dataset all by myself.
2. I don’t know much about machine learning and NLP techniques like how to approach this problem.
3. How to make correct prediction? For example if I have (pencil,book_store) in my dataset and it is given sharpener as input it must predict the label as book_store since sharpener is closely related to pencil.

My approaches:
I started with a small dataset and then used datamuse api for extending it by finding related words for a given word. For eg. I extracted all words related to pencil from the API and tagged them with the same label book_store. Then I used fastText for generating a prediction model but I am not getting expected results.

Small Dataset
Dataset format : (example,class_label)

soap,department_store
pencil,book_store
pen,book_store
tea,department_store
coffee,department_store
bulb,electronics_store
battery,electronics_store
tubelight,electronics_store
medicine,pharmacy
book,book_store
money,bank
cash,atm
flowers,florist
fruits,grocery_or_supermarket
cake,bakery
clothes,clothing_store
paper,book_store
jewellery,jewelry_store
shampoo,department_store
oil,department_store
sugar,department_store
beer,bar
whisky,bar
alcohol,bar
haircut,beauty_salon
coffee,cafe
sandwich,cafe
pastry,bakery
suit,clothing_store
shoes,shoe_store
sofa,furniture_store
chair,furniture_store
bed,furniture_store
petrol,gas_station
diesel,gas_station
tools,hardware_store
pipe,hardware_store
tank,hardware_store
washing,laundry
drycleaning,laundry
necklace,jewelry_store
ring,jewelry_store
ornament,jewelry_store
dinner,restaurant
lunch,restaurant
pet,veterinary_care
chips,department_store

Upvotes: 1

Views: 298

Answers (1)

outlier
outlier

Reputation: 351

As your problem is based on classification of text data, first you have divide training(70%) and testing data(30%) then first look for (example,class_label) your dataset format...

Step(1)---here class_label are also in text format so you have to label them numeric...for example--> department_store==1,book_store==2, electronics_store==3,shoe_store==4....and so on, label all of them in this format...

Step(2)---after this look for your example from (example,class_label) dataset format as these examples are also in text format so we have to make them numeric as well (remember machine learning algorithm works on numeric data only so we have to convert all text data into numeric format). here to work on text data use CountVectorizer()..give a look at these document these link will guide you how to perform feature extraction

after extracting features from text data use any algorithm for classification (remember you have to perform multi class classification as your dataset classes are multiple....all of the algorithm work on binary classification so you have to use (One Vs one) or (one vs rest) give a look at these link

I will prefer support vector machine(SVM) for training(70% of your total data) for these as you have small dataset. SVM For testing perform step(2) on remaining (30% of your total data)

Upvotes: 1

Related Questions