Reputation: 415
I am doing my first project in the NLP domain which is sentiment analysis of a dataset with ~250 tagged english data points/sentences. The dataset is reviews of a pharmaceutical product having positive, negative or neutral tags. I have worked with numeric data in supervised learning for 3 years but NLP is unchartered territory for me. So I want to know the best pre-processing techniques and the steps that I need to do that are best suited to my problem. A guideline from an NLP expert would be much appreciated!
Upvotes: 0
Views: 543
Reputation: 10826
Based on your comment on mohammad karami answer, what you haven't understood is the paragraph or sentence representation (you said "converting to numeric is the real question"). So in numerical data, suppose you have like a table with 2 columns (features) and a label, maybe something like "work experience", "age", and a label "salary" (to predict a salary based on age and work experience). In NLP, features are usually if not most of the time on word level (can sometimes be character level or subword level too). These features are called tokens. Now the columns are replaced with these tokens. The simplest way to make a paragraph representation is by using bag of words. So after preprocessing, every unique words will be mapped as column. So suppose we have data train with 2 rows as follows:
the unique words will become the column, so the table might look like:
I | help | you | and | should | me
Now the two samples would have value as follows:
Notice that the first element of the array is 1
, because both samples have word I
and occurred once, now see the second element is 2
on first row, and 0
on second row, because word help occurred twice on first row and never occurred on the second row. The logic behind this would be something like "if word A, word B... exists and word H, word I... doesn't exist, then the label is positive".
Bag of words works most of the time but it has problem such as dimensionality problem (imagine there are four billion unique words, the features are too many), and also notice that it doesn't take order of words into account, notice that similar words are represented the same way, and there are many more. The current state of the art for NLP is called BERT, learn that if you want to use what's best.
Upvotes: 1
Reputation: 101
First of all, you have to specify what features you want to have and then do the pre-processing. However, you can: 1- Remove HTML tags 2- Remove extra whitespaces 3- Convert accented characters to ASCII characters 4- Expand contractions 5- Remove special characters 5 - Lowercase all texts 6- Convert number words to numeric form 7- Remove numbers 8- Remove stopwords 9- Lemmatization Do your own Data. I suggest looking at the NLTK package for NLP. NLTK has sentiment analysis Function (maybe help your work). Then extract your features with tf-idf or any other feature extraction or feature selection algorithms . And then give the machine learning algorithm after scaling.
Upvotes: 0