Matt M
Matt M

Reputation: 309

Conditional Regex Function on Pandas Dataframe

I have the following df and function (see below). I might be over complicating this. A new set of fresh eyes would be deeply appreciated.

df:

Site Name   Plan Unique ID  Atlas Placement ID
Affectv     we11080301      11087207850894
Mashable    we14880202      11087208009031
Alphr       uk10790301      11087208005229
Alphr       uk19350201      11087208005228

The goal is to:

  1. Iter first through df['Plan Unique ID'], search for a specific value (we_match or uk_match), if there is a match

  2. Check that the string value is bigger than a certain value in that group (we12720203 or uk11350200)

  3. If the value is greater than add that we or uk value to a new column df['Consolidated ID'].

  4. If the value is lower or there is no match, then search df['Atlas Placement ID'] with new_id_search

  5. If there is a match, then add that to df['Consolidated ID']

  6. If not, return 0 to df['Consolidated ID]

The current problem is that it returns an empty column.

 def placement_extract(df="mediaplan_df", we_search="we\d{8}", uk_search="uk\d{8}", new_id_search= "(\d{14})"):

        if type(df['Plan Unique ID']) is str:
            we_match = re.search(we_search, df['Plan Unique ID'])
            if we_match:
                if we_match > "we12720203":
                    return we_match.group(0)
                else:
                    uk_match =  re.search(uk_search, df['Plan Unique ID'])
                    if uk_match:
                        if uk_match > "uk11350200":
                            return uk_match.group(0)
                        else:
                            match_new =  re.search(new_id_search, df['Atlas Placement ID'])
                            if match_new:
                                return match_new.group(0)

                            return 0


    mediaplan_df['Consolidated ID'] = mediaplan_df.apply(placement_extract, axis=1)

Edit: Cleaned the formula

I modified gzl's function in the following way (see below): First see if in df1 there is 14 numbers. If so, add that.

The next step, ideally would be to grab a column MediaPlanUnique from df2 and turn it into a series filtered_placements:

we11080301  
we12880304  
we14880202  
uk19350201  
uk11560205  
uk11560305  

And see if any of the values in filtered_placements are present in df['Plan Unique ID]. If there is a match, then add df['Plan Unique ID] to our end column = df[ConsolidatedID]

The current problem is that it results in all 0. I think it's because the comparison is been done as 1 to 1 (first result of new_match vs first result of filtered_placements) rather than 1 to many (first result of new_match vs all results of filtered_placements)

Any ideas?

def placement_extract(df="mediaplan_df", new_id_search="[a-zA-Z]{2}\d{8}", old_id_search= "(\d{14})"):

    if type(df['PlacementID']) is str:

        old_match =  re.search(old_id_search, df['PlacementID'])
        if old_match:
            return old_match.group(0)

        else:

            if type(df['Plan Unique ID']) is str:
                if type(filtered_placements) is str:


                    new_match = re.search(new_id_search, df['Plan Unique ID'])
                    if new_match:
                        if filtered_placements.str.contains(new_match.group(0)):
                            return new_match.group(0)          


        return 0

mediaplan_df['ConsolidatedID'] = mediaplan_df.apply(placement_extract, axis=1)

Upvotes: 2

Views: 3154

Answers (2)

Phil Sheard
Phil Sheard

Reputation: 2162

I've reorganised the logic and also simpified the regex operations to show another way to approach it. The reorganisation wasn't strictly necessary for the answer but as you asked for another opinion / way of approaching it I thought this might help you in future:

# Inline comments to explain the main changes.
def placement_extract(row, we_search="we12720203", uk_search="uk11350200"):
    # Extracted to shorter temp variable
    plan_id = row["Plan Unique ID"]
    # Using parenthesis to get two separate groups - code and numeric
    # Means you can do the match just once
    result = re.match("(we|uk)(.+)",plan_id)
    if result:
        code, numeric = result.groups()
        # We can get away with these simple tests as the earlier regex guarantees
        # that the string starts with either "we" or "uk"
        if code == "we" and plan_id > we_search:
            return_val = plan_id
        elif code == "uk" and plan_id > uk_search:
            return_val = plan_id
        else:
            # It looked like this column was used whatever happened at the
            # end, so there's no need to check against a regex
            #
            # The Atlas Placement is the default option if it either fails
            # the prefix check OR the "greater than" test
            return_val = row["Atlas Placement ID"]
    # A single return statement is often easier to debug
    return return_val

Then using in an apply statement (also look into assign):

$ mediaplan_df["Consolidated ID"] = mediaplan_df.apply(placement_extract, axis=1)
$ mediaplan_df
>   
Site Name Plan Unique ID Atlas Placement ID Consolidated ID
0   Affectv     we11080301     11087207850894  11087207850894
1  Mashable     we14880202     11087208009031      we14880202
2     Alphr     uk10790301     11087208005229  11087208005229
3     Alphr     uk19350201     11087208005228      uk19350201

Upvotes: 1

gzc
gzc

Reputation: 8629

I would recommend that don't use such complicate nested if statements. As Phil pointed out, each check is mutually-exclusive. Thus you can check 'we' and 'uk' in same indented if statement, then fall back to default process.

def placement_extract(df="mediaplan_df", we_search="we\d{8}", uk_search="uk\d{8}", new_id_search= "(\d{14})"):

    if type(df['Plan Unique ID']) is str:
        we_match = re.search(we_search, df['Plan Unique ID'])
        if we_match:
            if we_match.group(0) > "we12720203":
                return we_match.group(0)

        uk_match =  re.search(uk_search, df['Plan Unique ID'])
        if uk_match:
            if uk_match.group(0) > "uk11350200":
                return uk_match.group(0)


        match_new =  re.search(new_id_search, df['Atlas Placement ID'])

        if match_new:
            return match_new.group(0)

        return 0

Test:

In [37]: df.apply(placement_extract, axis=1)
Out[37]:
0    11087207850894
1        we14880202
2    11087208005229
3        uk19350201
dtype: object

Upvotes: 1

Related Questions