Reputation: 251
I would like to clean text column in a good and efficient way. The dataset is
pos_tweets = [('I loved that car!!', 'positive'),
('This view is amazing...', 'positive'),
('I feel very, very, great this morning :)', 'positive'),
('I am so excited about the concerts', 'positive'),
('He is my best friend', 'positive')]
df = pd.DataFrame(pos_tweets)
df.columns = ["tweet","class"]
df["tweet"] = df["tweet"].str.lower().str.split()
I am trying to remove
from column tweets
and apply stemming.
I tried as follows:
from nltk.corpus import stopwords
import pandas as pd
from nltk.stem.snowball import SnowballStemmer
stop = stopwords.words('English')
df.replace(to_replace='I', value="",regex=True) # what if I had more text columns?
df['cleaned'] = df['tweet'].str.replace('[^\w\s]','')
df['cleaned'] = df['cleaned'].str.replace('\d+', '')
# Use English stemmer
stemmer = SnowballStemmer("English")
df['all_cleaned'] = df['cleaned'].apply(lambda x: [stemmer.stem(y) for y in x]) # Stem every word.
However I am getting an error:---> 21 df['cleaned'] = df['cleaned'].str.replace('\d+', '')
: AttributeError: Can only use .str accessor with string values!
Expected output would be
tweet class
0 love car positive
1 view amazing positive
2 feel very very great morning positive
3 be excite about concert positive
4 best friend positive
Upvotes: 0
Views: 5485
Reputation: 1923
To strictly answer your question about why you get this error:
You have to add .astype(str)
.
And your patterns as raw strings (r'[^\w\s]'
).
Working code:
import pandas as pd
pos_tweets = [('I loved that car!!', 'positive'),
('This view is amazing...', 'positive'),
('I feel very, very, great this morning :)', 'positive'),
('I am so excited about the concerts', 'positive'),
('He is my best friend', 'positive')]
df = pd.DataFrame(pos_tweets)
df.columns = ["tweet","class"]
df["tweet"] = df["tweet"].str.lower().str.split()
df.replace(to_replace='I', value="",regex=True) # what if I had more text columns?
df['cleaned'] = df['tweet'].astype(str).str.replace(r'[^\w\s]','')
df['cleaned'] = df['cleaned'].astype(str).str.replace(r'\d+', '')
But it will not replace because there are another problems in your code:
df["tweet"] = df["tweet"].str.lower().str.split()
will create lists of strings, not strings. So using replace wil not work.regex=True
and inplace=True
in the other calls to replace.str.lower()
So it should be:
(I changed the regex patterns, sothat you will see them working. You will just have to change them with what you want)
import pandas as pd
pos_tweets = [('I loved that car!!', 'positive'),
('This view is amazing...', 'positive'),
('I feel very, very, great this morning :)', 'positive'),
('I am so excited about the concerts', 'positive'),
('He is my best friend', 'positive')]
df = pd.DataFrame(pos_tweets)
df.columns = ["tweet","class"]
df["tweet"] = df["tweet"].str.lower()
df.replace('i', "0",regex=True, inplace=True)
df['cleaned'] = df['tweet'].astype(str).str.replace(r'0','1')
df['cleaned'].replace(r'\d+', '2', regex=True, inplace=True)
And for the other questions about stopwords, etc. everything is fine because @Sandeep Panchal provided a complete working code :-). Happy coding!
Upvotes: 2
Reputation: 385
If you want to remove even NLTK defined stopwords such as i, this, is, etc, you can use the NLTK's defined stopwords. Refer to the below code and see if this satisfies your requirements or not.
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.stem.snowball import SnowballStemmer
st = SnowballStemmer('english')
# your define dataframe
pos_tweets = [('I loved that car!!', 'positive'),
('This view is amazing...', 'positive'),
('I feel very, very, great this morning :)', 'positive'),
('I am so excited about the concerts', 'positive'),
('He is my best friend', 'positive')]
df = pd.DataFrame(pos_tweets)
df.columns = ["tweet","class"]
# function to clean data
def clean_data(df, col, clean_col):
# change to lower and remove spaces on either side
df[clean_col] = df[col].apply(lambda x: x.lower().strip())
# remove extra spaces in between
df[clean_col] = df[clean_col].apply(lambda x: re.sub(' +', ' ', x))
# remove punctuation
df[clean_col] = df[clean_col].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))
# remove stopwords and get the stem
df[clean_col] = df[clean_col].apply(lambda x: ' '.join(st.stem(text) for text in x.split() if text not in stop_words))
return df
# calling function
dfr = clean_data(df, 'tweet', 'clean_tweet')
Upvotes: 3