Reputation: 21
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import datetime
import numpy as np
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
# Load the Pandas libraries with alias 'pd'
import pandas as pd
# Read data from file 'filename.csv'
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later)
data = pd.read_csv("march20_21.csv")
# Preview the first 5 lines of the loaded data
#drop NA rows
data.dropna()
#drop all columns not needed
droppeddata = data.drop(columns=['created_at'])
#drop NA rows
alldata = droppeddata.dropna()
ukdata = alldata[alldata.place.str.contains('England')]
ukdata.drop(columns=['place'])
ukdata['text'].apply(word_tokenize)
eng_stopwords = stopwords.words('english')
I know there is a lot of redundant variables, but im still working on gettig it working before going back to refine it.
I am unsure on how to remove the stopwords, stored in the variable, from the tokenised columns. Any help is appreciated, I am brand new to Python! Thanks.
Upvotes: 1
Views: 5158
Reputation: 30629
after applying a function to a column you need to assign the result back to the column, it's not an in-place operation.
after tokenization ukdata['text']
holds a list
of words, so you can use a list comprehension in the apply to remove the stop words.
ukdata['text'] = ukdata['text'].apply(word_tokenize)
eng_stopwords = stopwords.words('english')
ukdata['text'] = ukdata['text'].apply(lambda words: [word for word in words if word not in eng_stopwords])
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
eng_stopwords = stopwords.words('english')
ukdata = pd.DataFrame({'text': ["This is a sentence."]})
ukdata['text'] = ukdata['text'].apply(word_tokenize)
ukdata['text'] = ukdata['text'].apply(lambda words: [word for word in words if word not in eng_stopwords])
Upvotes: 2