Reputation: 7255
Here's my data
No Body
1 DaTa Analytics 2
2 StackOver 67
Here's my expected output
No Body Uppercase Lowercase
1 DaTa Analytics 2 3 10
2 StackOver 67 2 7
Upvotes: 4
Views: 4312
Reputation: 862921
Use str.findall
for extract upper and lower case and str.len
for lengths:
df['Uppercase'] = df['Body'].str.findall(r'[A-Z]').str.len()
df['Lowercase'] = df['Body'].str.findall(r'[a-z]').str.len()
Another solution:
df['Uppercase'] = df['Body'].str.count(r'[A-Z]')
df['Lowercase'] = df['Body'].str.count(r'[a-z]')
print (df)
No Body Uppercase Lowercase
0 1 DaTa Analytics 3 10
1 2 StackOver 2 7
Upvotes: 6
Reputation: 402673
Here's an extremely performant solution that manipulates ASCII codes:
v = df.Body.values.astype(str)
v = v.view(np.uint8).reshape(len(df), -1)
df['Uppercase'] = ((v >= 65) & (v <= 90)).sum(1)
df['Lowercase'] = ((v >= 97) & (v <= 122)).sum(1)
df
No Body Uppercase Lowercase
0 1 DaTa Analytics 3 10
1 2 StackOver 2 7
Timings
df = pd.concat([df] * 100000, ignore_index=True)
# @jezrael1
%%timeit
df['Uppercase'] = df['Body'].str.findall(r'[A-Z]').str.len()
df['Lowercase'] = df['Body'].str.findall(r'[a-z]').str.len()
979 ms ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# @jezrael2
%%timeit
df['Uppercase'] = [sum(1 for c in x if c.isupper()) for x in df['Body']]
df['Lowercase'] = [sum(1 for c in x if c.islower()) for x in df['Body']]
1.11 s ± 130 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# in this post
%%timeit
v = df.Body.values.astype(str)
v = v.view(np.uint8).reshape(len(df), -1)
df['Uppercase'] = ((v >= 65) & (v <= 90)).sum(1)
df['Lowercase'] = ((v >= 97) & (v <= 122)).sum(1)
91.8 ms ± 315 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 2