Math
Math

Reputation: 251

Cleaning text using nltk

I would like to clean text column in a good and efficient way. The dataset is

pos_tweets = [('I loved that car!!', 'positive'),
    ('This view is amazing...', 'positive'),
    ('I feel very, very, great this morning :)', 'positive'),
    ('I am so excited about the concerts', 'positive'),
    ('He is my best friend', 'positive')]

df = pd.DataFrame(pos_tweets)
df.columns = ["tweet","class"]
df["tweet"] = df["tweet"].str.lower().str.split()

I am trying to remove

from column tweets and apply stemming.

I tried as follows:

from nltk.corpus import stopwords
import pandas as pd
from nltk.stem.snowball import SnowballStemmer

stop = stopwords.words('English')
df.replace(to_replace='I', value="",regex=True) # what if I had more text columns?
df['cleaned'] = df['tweet'].str.replace('[^\w\s]','')
df['cleaned'] = df['cleaned'].str.replace('\d+', '')

# Use English stemmer
stemmer = SnowballStemmer("English")

df['all_cleaned'] = df['cleaned'].apply(lambda x: [stemmer.stem(y) for y in x]) # Stem every word.

However I am getting an error:---> 21 df['cleaned'] = df['cleaned'].str.replace('\d+', '') : AttributeError: Can only use .str accessor with string values!

Expected output would be

   tweet     class
0                     love car  positive
1                  view amazing  positive
2  feel very very great morning  positive
3      be excite about concert  positive
4                   best friend  positive

Upvotes: 0

Views: 5485

Answers (2)

Rivers
Rivers

Reputation: 1923

To strictly answer your question about why you get this error:

You have to add .astype(str). And your patterns as raw strings (r'[^\w\s]').

Working code:

import pandas as pd
pos_tweets = [('I loved that car!!', 'positive'),
    ('This view is amazing...', 'positive'),
    ('I feel very, very, great this morning :)', 'positive'),
    ('I am so excited about the concerts', 'positive'),
    ('He is my best friend', 'positive')]

df = pd.DataFrame(pos_tweets)
df.columns = ["tweet","class"]
df["tweet"] = df["tweet"].str.lower().str.split()

df.replace(to_replace='I', value="",regex=True) # what if I had more text columns?
df['cleaned'] = df['tweet'].astype(str).str.replace(r'[^\w\s]','')
df['cleaned'] = df['cleaned'].astype(str).str.replace(r'\d+', '')

But it will not replace because there are another problems in your code:

  1. Using df["tweet"] = df["tweet"].str.lower().str.split() will create lists of strings, not strings. So using replace wil not work.
  2. You will have to use regex=True and inplace=True in the other calls to replace
  3. Some of your patterns does not match any existing substrings. For example, you are trying to match "I" but there is no "I" but a "i" because you called .str.lower()

So it should be:

(I changed the regex patterns, sothat you will see them working. You will just have to change them with what you want)

import pandas as pd
pos_tweets = [('I loved that car!!', 'positive'),
    ('This view is amazing...', 'positive'),
    ('I feel very, very, great this morning :)', 'positive'),
    ('I am so excited about the concerts', 'positive'),
    ('He is my best friend', 'positive')]

df = pd.DataFrame(pos_tweets)
df.columns = ["tweet","class"]
df["tweet"] = df["tweet"].str.lower()

df.replace('i', "0",regex=True, inplace=True)
df['cleaned'] = df['tweet'].astype(str).str.replace(r'0','1')
df['cleaned'].replace(r'\d+', '2', regex=True, inplace=True)

And for the other questions about stopwords, etc. everything is fine because @Sandeep Panchal provided a complete working code :-). Happy coding!

Upvotes: 2

Sandeep Panchal
Sandeep Panchal

Reputation: 385

If you want to remove even NLTK defined stopwords such as i, this, is, etc, you can use the NLTK's defined stopwords. Refer to the below code and see if this satisfies your requirements or not.

import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.stem.snowball import SnowballStemmer
st = SnowballStemmer('english')

# your define dataframe
pos_tweets = [('I loved that car!!', 'positive'),
('This view is amazing...', 'positive'),
('I feel very, very, great this morning :)', 'positive'),
('I am so excited about the concerts', 'positive'),
('He is my best friend', 'positive')]

df = pd.DataFrame(pos_tweets)
df.columns = ["tweet","class"]

# function to clean data
def clean_data(df, col, clean_col):

    # change to lower and remove spaces on either side
    df[clean_col] = df[col].apply(lambda x: x.lower().strip())

    # remove extra spaces in between
    df[clean_col] = df[clean_col].apply(lambda x: re.sub(' +', ' ', x))

    # remove punctuation
    df[clean_col] = df[clean_col].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))

    # remove stopwords and get the stem
    df[clean_col] = df[clean_col].apply(lambda x: ' '.join(st.stem(text) for text in x.split() if text not in stop_words))

    return df

# calling function
dfr = clean_data(df, 'tweet', 'clean_tweet')

Below is the output image

enter image description here

Upvotes: 3

Related Questions