Reputation: 17631
I have the following pandas DataFrame in Python3.x, with several numeric columns and two columns with strings:
import numpy as np
import pandas as pd
dict = {"numericvals": np.repeat(25, 8),
"numeric":np.repeat(42, 8),
"first":["beneficiary, duke", "compose", "herd primary", "stall", "deep", "regular summary classify", "timber", "property”],
"second": ["abcde”, "abcde”, "abcde”, "abcde”, "abcde”, "abcde”, "abcde”, "abcde”]}
df = pd.DataFrame(dict1)
df = df[['numeric', 'numericvals', 'first', 'second']]
print(df)
numeric numericvals first second
0 42 25 beneficiary, duke abcde
1 42 25 compose abcde
2 42 25 herd primary abcde
3 42 25 stall abcde
4 42 25 deep abcde
5 42 25 regular summary classify abcde
6 42 25 timber abcde
7 42 25 property abcde
The column first
contains one or more strings. If there are more that one, these are separated by either a space or comma.
My goal is to create a column recording the length of strings in first
which are longer or shorter in length than the strings in second
. If these are the same size, this case should be ignored.
My idea would be to create two lists:
longer = []
shorter = []
If the string in first
is longer, append the string length via len()
in longer
. If the string is shorter, record the string length in short
via len()
.
Here is how the analysis should look like (in pandas DataFrame format):
numericvals numeric first second longer shorter
0 25 42 beneficiary, duke abcde 11 4
1 25 42 compose abcde 7 0
2 25 42 herd primary abcde 7 4
3 25 42 stall abcde 0 0
4 25 42 deep abcde 0 4
5 25 42 regular summary classify abcde 7, 7, 8 0
6 25 42 timber abcde 6 0
7 25 42 property abcde 8 0
I don't know how to deal with multiple strings in first
, especially if there are 3. How should one do this comparison in pandas?
Upvotes: 2
Views: 1127
Reputation: 305
You can use pandas.DataFrame.apply
(source):
import operator
def transform(df, op):
lengths = [len(s) for s in df['first'].replace(',', ' ').split()]
return [f for f in lengths if op(f, len(df.second))] or [0]
df['longer'] = df.apply(transform, axis=1, args=[operator.gt])
df['shorter'] = df.apply(transform, axis=1, args=[operator.lt])
This should work for any amount of strings, assuming any space or comma indicates a new string.
Here is the output:
numeric numericvals first second longer shorter
0 42 25 beneficiary, duke abcde [11] [4]
1 42 25 compose abcde [7] [0]
2 42 25 herd primary abcde [7] [4]
3 42 25 stall abcde [0] [0]
4 42 25 deep abcde [0] [4]
5 42 25 regular summary classify abcde [7, 7, 8] [0]
6 42 25 timber abcde [6] [0]
7 42 25 property abcde [8] [0]
I tried my best. Hope this helps!
Upvotes: 1