Cleaning text using nltk

Question

I would like to clean text column in a good and efficient way. The dataset is

pos_tweets = [('I loved that car!!', 'positive'),
    ('This view is amazing...', 'positive'),
    ('I feel very, very, great this morning :)', 'positive'),
    ('I am so excited about the concerts', 'positive'),
    ('He is my best friend', 'positive')]

df = pd.DataFrame(pos_tweets)
df.columns = ["tweet","class"]
df["tweet"] = df["tweet"].str.lower().str.split()

I am trying to remove

stopwords
punctuation
words greater than a threshold to set(words having less than 3 chars)
numbers

from column tweets and apply stemming.

I tried as follows:

from nltk.corpus import stopwords
import pandas as pd
from nltk.stem.snowball import SnowballStemmer

stop = stopwords.words('English')
df.replace(to_replace='I', value="",regex=True) # what if I had more text columns?
df['cleaned'] = df['tweet'].str.replace('[^\w\s]','')
df['cleaned'] = df['cleaned'].str.replace('\d+', '')

# Use English stemmer
stemmer = SnowballStemmer("English")

df['all_cleaned'] = df['cleaned'].apply(lambda x: [stemmer.stem(y) for y in x]) # Stem every word.

However I am getting an error:---> 21 df['cleaned'] = df['cleaned'].str.replace('\d+', '') : AttributeError: Can only use .str accessor with string values!

Expected output would be

   tweet     class
0                     love car  positive
1                  view amazing  positive
2  feel very very great morning  positive
3      be excite about concert  positive
4                   best friend  positive

Sandeep Panchal · Accepted Answer

If you want to remove even NLTK defined stopwords such as i, this, is, etc, you can use the NLTK's defined stopwords. Refer to the below code and see if this satisfies your requirements or not.

import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.stem.snowball import SnowballStemmer
st = SnowballStemmer('english')

# your define dataframe
pos_tweets = [('I loved that car!!', 'positive'),
('This view is amazing...', 'positive'),
('I feel very, very, great this morning :)', 'positive'),
('I am so excited about the concerts', 'positive'),
('He is my best friend', 'positive')]

df = pd.DataFrame(pos_tweets)
df.columns = ["tweet","class"]

# function to clean data
def clean_data(df, col, clean_col):

    # change to lower and remove spaces on either side
    df[clean_col] = df[col].apply(lambda x: x.lower().strip())

    # remove extra spaces in between
    df[clean_col] = df[clean_col].apply(lambda x: re.sub(' +', ' ', x))

    # remove punctuation
    df[clean_col] = df[clean_col].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))

    # remove stopwords and get the stem
    df[clean_col] = df[clean_col].apply(lambda x: ' '.join(st.stem(text) for text in x.split() if text not in stop_words))

    return df

# calling function
dfr = clean_data(df, 'tweet', 'clean_tweet')

Cleaning text using nltk

Answers (2)

Below is the output image

Related Questions