dd.
dd.

Reputation: 801

Pandas string tokeniztion too slow

I have a column in a Pandas DataFrame, where each row has some string containing a job description like 'senior data consultant', and there are approximately 1,000,000 of these rows. I want to shorten this string to just the first word (which in that example would give 'senior'). The code below does this without error.

def proc_Profession(df):
    for row in range(df['Profession'].size):
        try:
            df['Profession'].iloc[row] = df['Profession'].iloc[row].split(' ')[0]
        except AttributeError:
            df['Profession'].iloc[row] = 'unknown'
    return df

The problem that I have is that this is too slow (it takes several hours), is there a faster way of doing this?

Upvotes: 2

Views: 288

Answers (1)

dd.
dd.

Reputation: 801

As per Henry Yik's suggestion, the following is significantly faster

def proc_Profession(df):
    df['Profession'] = df['Profession'].str.split().str[0]
    return df

Upvotes: 1

Related Questions