ShanZhengYang
ShanZhengYang

Reputation: 17631

Comparing multiple strings in a pandas DataFrame column

I have the following pandas DataFrame in Python3.x, with several numeric columns and two columns with strings:

import numpy as np
import pandas as pd

dict = {"numericvals": np.repeat(25, 8), 
    "numeric":np.repeat(42, 8), 
    "first":["beneficiary, duke", "compose", "herd primary", "stall", "deep", "regular summary classify", "timber", "property”], 
    "second": ["abcde”, "abcde”, "abcde”, "abcde”, "abcde”, "abcde”, "abcde”, "abcde”]}

df = pd.DataFrame(dict1)

df = df[['numeric', 'numericvals', 'first', 'second']]

print(df)
   numeric  numericvals                     first second
0       42           25         beneficiary, duke  abcde
1       42           25                   compose  abcde
2       42           25              herd primary  abcde
3       42           25                     stall  abcde
4       42           25                      deep  abcde
5       42           25  regular summary classify  abcde
6       42           25                    timber  abcde
7       42           25                  property  abcde

The column first contains one or more strings. If there are more that one, these are separated by either a space or comma.

My goal is to create a column recording the length of strings in first which are longer or shorter in length than the strings in second. If these are the same size, this case should be ignored.

My idea would be to create two lists:

longer = []
shorter = []

If the string in first is longer, append the string length via len() in longer. If the string is shorter, record the string length in short via len().

Here is how the analysis should look like (in pandas DataFrame format):

   numericvals numeric                   first second  longer  shorter
0          25      42        beneficiary, duke abcde  11       4
1          25      42                  compose abcde  7        0
2          25      42             herd primary abcde  7        4
3          25      42                    stall abcde  0        0
4          25      42                     deep abcde  0        4
5          25      42 regular summary classify abcde  7, 7, 8  0
6          25      42                   timber abcde  6        0
7          25      42                 property abcde  8        0

I don't know how to deal with multiple strings in first, especially if there are 3. How should one do this comparison in pandas?

Upvotes: 2

Views: 1127

Answers (1)

JayFresco
JayFresco

Reputation: 305

You can use pandas.DataFrame.apply (source):

import operator

def transform(df, op):
    lengths = [len(s) for s in df['first'].replace(',', ' ').split()]
    return [f for f in lengths if op(f, len(df.second))] or [0]

df['longer']  = df.apply(transform, axis=1, args=[operator.gt])
df['shorter'] = df.apply(transform, axis=1, args=[operator.lt])

This should work for any amount of strings, assuming any space or comma indicates a new string.

Here is the output:

   numeric  numericvals                     first second     longer shorter
0       42           25         beneficiary, duke  abcde       [11]     [4]
1       42           25                   compose  abcde        [7]     [0]
2       42           25              herd primary  abcde        [7]     [4]
3       42           25                     stall  abcde        [0]     [0]
4       42           25                      deep  abcde        [0]     [4]
5       42           25  regular summary classify  abcde  [7, 7, 8]     [0]
6       42           25                    timber  abcde        [6]     [0]
7       42           25                  property  abcde        [8]     [0]

I tried my best. Hope this helps!

Upvotes: 1

Related Questions