Speeding up pd.DataFrame.apply() on column of large dataset

Question

I'm applying the following function to a dataset containing ~23000 rows, and it's running very slowly. I imagine this is because there's for loops nested within the function that I've used to strip the punctuation and stopwords. I've been running the line that applies text_process to my dataframe for nearly 15 minutes now, but I'm wondering if there is a smarter way for me to do this processing.

Open to all suggestions!

Here's my code:

def text_process(text):
    """
    Takes in string of text, and does following operations: 
    1. Removes punctuation + unicode quotations. 
    2. Removes stopwords. 
    3. Returns a list of cleaned "tokenized" text.
    """

    punctuation = [c for c in string.punctuation] + [u'\u201c',u'\u201d',u'\u2018',u'\u2019']

    nopunc = [char for char in text if char not in punctuation]

    nopunc = ''.join(nopunc)

    return [word.lower() for word in nopunc.split() if word not in 
           stopwords.words('english')]

pitchfork['content_clean'] = pitchfork['content'].apply(text_process)

A Poor · Accepted Answer

I would try using regex to remove the punctuation. If you just want letters, digits, and spaces, for example, you could do something like this:

import re

#...

def clean_text(content):
    """Converts string to lowercase and removes 
    any characters that aren't a-z, 0-9 or whitespace"""
    return re.sub(r"[^a-z0-9\s]","",content.lower())


pitchfork['content_clean'] = pitchfork['content'].apply(clean_text)

Also, you're wasting time by recreating the punctuation list ever time the function runs.

Speeding up pd.DataFrame.apply() on column of large dataset

Answers (2)

Related Questions