Reputation: 2439
I have a data frame containing one column with different names. I extract features from these names and store them into a dictionary. Then I want to create a column for each feature and store values for each name. I'm struggling to get my loop right.
My code:
import pandas as pd
data = pd.DataFrame(['Mike', 'Ester', 'Sarah'])
data.columns = ['name']
def get_features(name):
features = {}
features["firstletter"] = name[0].lower()
features["lastletter"] = name[-1].lower()
return features
for name in data['name']:
features = get_features(name)
print features
for f,v in features.items():
data[f] = v
data.head()
I get:
name lastletter firstletter
0 Mike h s
1 Ester h s
2 Sarah h s
I need:
name lastletter firstletter
0 Mike e m
1 Ester r e
2 Sarah h s
I understand why all names get values from the last name but I cannot figure out how to fix it. I probably can create new headers for all features first and then update my data frame but I hope there is a smarter way. Will appreciate your help!
EDIT: My feature function is much more complicated than just first/last letter. It contains around 20 different features so I really need to build a dictionary...
def get_features(name):
features = {}
features["firstletter"] = name[0].lower()
features["lastletter"] = name[-1].lower()
features["hythen"] = ("-" in name.lower())
features["suffix"] = name[-2:].lower()
features["prefix"] = name[0:2].lower()
features["length"] = len(name)
for letter in 'abcdefghijklmnopqrstuvwxyz':
features["count(%s)" % letter] = name.lower().count(letter)
features["has(%s)" % letter] = (letter in name.lower())
return features
Upvotes: 2
Views: 1741
Reputation: 294298
New Answer
Change your function to return a pd.Series
and do lower
only once.
def get_features(name):
features = {}
name = name.lower()
features["firstletter"] = name[0]
features["lastletter"] = name[-1]
features["hythen"] = ("-" in name)
features["suffix"] = name[-2:]
features["prefix"] = name[0:2]
features["length"] = len(name)
for letter in 'abcdefghijklmnopqrstuvwxyz':
features["count(%s)" % letter] = name.count(letter)
features["has(%s)" % letter] = (letter in name)
return pd.Series(features)
Then use apply
data.join(data.name.apply(get_features))
name count(a) count(b) count(c) count(d) count(e) count(f) count(g) count(h) count(i) ... has(v) has(w) has(x) has(y) has(z) hythen lastletter length prefix suffix
0 Mike 0 0 0 0 1 0 0 0 1 ... False False False False False False e 4 mi ke
1 Ester 0 0 0 0 2 0 0 0 0 ... False False False False False False r 5 es er
2 Sarah 2 0 0 0 0 0 0 1 0 ... False False False False False False h 5 sa ah
Old Answer
data.assign(
**data.name.str.lower().str.extract(
'^(?P<firstletter>.).*(?P<lastletter>.)$', expand=True
)
)
name firstletter lastletter
0 Mike m e
1 Ester e r
2 Sarah s h
Upvotes: 1
Reputation: 210852
I'd do it this way:
In [107]: data[['first_letter','last_letter']] = \
data.name.str.lower().str.extract(r'^(.).*(.)$', expand=True)
In [108]: data
Out[108]:
name first_letter last_letter
0 Mike m e
1 Ester e r
2 Sarah s h
UPDATE:
In [127]: df.join(pd.DataFrame.from_records(df.apply(lambda x: get_features(x['name']),
axis=1).values,
index=df.index))
Out[127]:
name count(a) count(b) count(c) count(d) count(e) count(f) \
0 Mike 0 0 0 0 1 0
1 Ester 0 0 0 0 2 0
2 Sarah 2 0 0 0 0 0
count(g) count(h) count(i) ... has(v) has(w) has(x) has(y) \
0 0 0 1 ... False False False False
1 0 0 0 ... False False False False
2 0 1 0 ... False False False False
has(z) hythen lastletter length prefix suffix
0 False False e 4 mi ke
1 False False r 5 es er
2 False False h 5 sa ah
[3 rows x 59 columns]
Upvotes: 2