Reputation: 3
I'm applying the following function to a dataset containing ~23000 rows, and it's running very slowly. I imagine this is because there's for loops nested within the function that I've used to strip the punctuation and stopwords. I've been running the line that applies text_process
to my dataframe for nearly 15 minutes now, but I'm wondering if there is a smarter way for me to do this processing.
Open to all suggestions!
Here's my code:
def text_process(text):
"""
Takes in string of text, and does following operations:
1. Removes punctuation + unicode quotations.
2. Removes stopwords.
3. Returns a list of cleaned "tokenized" text.
"""
punctuation = [c for c in string.punctuation] + [u'\u201c',u'\u201d',u'\u2018',u'\u2019']
nopunc = [char for char in text if char not in punctuation]
nopunc = ''.join(nopunc)
return [word.lower() for word in nopunc.split() if word not in
stopwords.words('english')]
pitchfork['content_clean'] = pitchfork['content'].apply(text_process)
Upvotes: 0
Views: 90
Reputation: 16700
Just to stay close to your code. Moving some operations outside the function that is called 23000 times will speed things up (creating a punctuation string or fetching stopwords for English is not necessary in each of the calls):
punctuation = [c for c in string.punctuation] + [u'\u201c', u'\u201d', u'\u2018', u'\u2019']
stopwords2 = set(stopwords.words('english'))
def text_process(text):
"""
Takes in string of text, and does following operations:
1. Removes punctuation + unicode quotations.
2. Removes stopwords.
3. Returns a list of cleaned "tokenized" text.
"""
nopunc = (char for char in text if char not in punctuation) # changed to a generator
nopunc2 = ''.join(nopunc)
return [word.lower() for word in nopunc2.split() if word not in stopwords2]
pitchfork['content_clean'] = pitchfork['content'].apply(text_process)
There are further improvements possible with use of re.sub
(regex replace) though...
Upvotes: 1
Reputation: 1064
I would try using regex to remove the punctuation. If you just want letters, digits, and spaces, for example, you could do something like this:
import re
#...
def clean_text(content):
"""Converts string to lowercase and removes
any characters that aren't a-z, 0-9 or whitespace"""
return re.sub(r"[^a-z0-9\s]","",content.lower())
pitchfork['content_clean'] = pitchfork['content'].apply(clean_text)
Also, you're wasting time by recreating the punctuation list ever time the function runs.
Upvotes: 1