Isaac Nikolai Fox
Isaac Nikolai Fox

Reputation: 3

Speeding up pd.DataFrame.apply() on column of large dataset

I'm applying the following function to a dataset containing ~23000 rows, and it's running very slowly. I imagine this is because there's for loops nested within the function that I've used to strip the punctuation and stopwords. I've been running the line that applies text_process to my dataframe for nearly 15 minutes now, but I'm wondering if there is a smarter way for me to do this processing.

Open to all suggestions!

Here's my code:

def text_process(text):
    """
    Takes in string of text, and does following operations: 
    1. Removes punctuation + unicode quotations. 
    2. Removes stopwords. 
    3. Returns a list of cleaned "tokenized" text.
    """

    punctuation = [c for c in string.punctuation] + [u'\u201c',u'\u201d',u'\u2018',u'\u2019']

    nopunc = [char for char in text if char not in punctuation]

    nopunc = ''.join(nopunc)

    return [word.lower() for word in nopunc.split() if word not in 
           stopwords.words('english')]

pitchfork['content_clean'] = pitchfork['content'].apply(text_process)

Upvotes: 0

Views: 90

Answers (2)

sophros
sophros

Reputation: 16700

Just to stay close to your code. Moving some operations outside the function that is called 23000 times will speed things up (creating a punctuation string or fetching stopwords for English is not necessary in each of the calls):

punctuation = [c for c in string.punctuation] + [u'\u201c', u'\u201d', u'\u2018', u'\u2019']

stopwords2 = set(stopwords.words('english'))


def text_process(text):
    """
    Takes in string of text, and does following operations: 
    1. Removes punctuation + unicode quotations. 
    2. Removes stopwords. 
    3. Returns a list of cleaned "tokenized" text.
    """
    nopunc = (char for char in text if char not in punctuation)  # changed to a generator

    nopunc2 = ''.join(nopunc)

    return [word.lower() for word in nopunc2.split() if word not in stopwords2]


pitchfork['content_clean'] = pitchfork['content'].apply(text_process)

There are further improvements possible with use of re.sub (regex replace) though...

Upvotes: 1

A Poor
A Poor

Reputation: 1064

I would try using regex to remove the punctuation. If you just want letters, digits, and spaces, for example, you could do something like this:

import re

#...

def clean_text(content):
    """Converts string to lowercase and removes 
    any characters that aren't a-z, 0-9 or whitespace"""
    return re.sub(r"[^a-z0-9\s]","",content.lower())


pitchfork['content_clean'] = pitchfork['content'].apply(clean_text)

Also, you're wasting time by recreating the punctuation list ever time the function runs.

Upvotes: 1

Related Questions