Craig Bing
Craig Bing

Reputation: 319

apply function to each word of every row in pandas dataframe column

I have a sample dataframe as follows:

df = pd.DataFrame({
'notes': pd.Series(['speling', 'korrecter']), 
'name': pd.Series(['Walter White', 'Walter White']), 
})

  name                notes
0  Walter White     This speling is incorrect
1  Walter White     Corrector should correct korrecter

I want to adapt the spell checker by Peter Norvig available here. I would then like to apply this function to every row by going over every word in the row. I was wondering how can this be done in Python Pandas context?

I would like the output as:

    name                notes
0  Walter White     This spelling is incorrect
1  Walter White     Corrector should correct corrector 

Appreciate any inputs. Thanks!

Upvotes: 1

Views: 1948

Answers (2)

jezrael
jezrael

Reputation: 862791

You can try this solution with str.split, but I think performance in big df can be problematic:

import pandas as pd
import numpy as np

df = pd.DataFrame({
'notes': pd.Series(['This speling is incorrect', 'Corrector should correct korrecter one']), 
'name': pd.Series(['Walter White', 'Walter White']), 
})
print df
           name                                   notes
0  Walter White               This speling is incorrect
1  Walter White  Corrector should correct korrecter one    

#simulate function correct
def correct(x):
    return x + '888'

#split column notes and apply correct
df1 = df.notes.str.split(expand=True).apply(correct)
print df1
              0           1           2             3       4
0       This888  speling888       is888  incorrect888     NaN
1  Corrector888   should888  correct888  korrecter888  one888

#remove NaN and concanecate all words together
df['notes'] = df1.fillna('').apply(lambda row: ' '.join(row), axis=1)
print df
           name                                              notes
0  Walter White             This888 speling888 is888 incorrect888 
1  Walter White  Corrector888 should888 correct888 korrecter888...

Upvotes: 1

jhoepken
jhoepken

Reputation: 1858

I have used the code from the link you have posted in order to make it work. Use this as an inspiration.

import re, collections
import pandas as pd

# This code comes from the link you have posted
def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

# This is your code
df = pd.DataFrame({
'notes': pd.Series(['speling', 'korrecter']), 
'name': pd.Series(['Walter White', 'Walter White']), 
})

# Spellchecking can be optimized, of course and not hardcoded
for i, row in df.iterrows():
    df.set_value(i,'notes',correct(row['notes']))

Upvotes: 0

Related Questions