lida
lida

Reputation: 31

Python. Split string using any word from a list of word

I have a list of words.

trails = ("Fire trail", "Firetrail", "Fire Trail", "FT", "firetrail")

I need to split another string based on any of these words.
So, say, if the names to check are:

I want to modify them to look like this:

Split before one of the word from trail list and only copy the part before.

Thanks!

I should add, my code starts with:

for f in arcpy.da.SearchCursor("firetrail_O_noD_Layer", "FireTrailName", None, None):
...     if any(var in str(f[0]) for var in trail):
...         new_field = *that part of string without any fire trails and anything after it*

str(f[0]) is referring to the names from the first list new_field is refereing to the names I have in my second list, which I need to create

Upvotes: 2

Views: 11279

Answers (5)

Jan Vlcinsky
Jan Vlcinsky

Reputation: 44092

As it seems, the requirements and solution shall be clarified and tested iteratively, I provide here proposed solution incl. test suite to be used with pytest.

First, create test_trails.py file:

import pytest


def fix_trails(trails):
    """Clean up list of trails to make sure, longest phrases are processed
    with highest priority (are sooner in the list).

    This is needed, if some trail phrases contain other ones.
    """
    trails.sort(key=len, reverse=True)
    return trails


@pytest.fixture
def trails():
    phrases = ["Fire trail", "Firetrail", "Fire Trail",
               "FT", "firetrail", "Trail", "Fire Trails"]
    return fix_trails(phrases)


def remove_trails(line, trails):
    for trail in trails:
        if trail in line:
            res = line.replace(trail, "").strip()
            return res.replace("  ", " ")
    return line


scenarios = [
    ["Poverty Point FT", "Poverty Point"],
    ["Cedar Party Fire Trails", "Cedar Party Fire"],
    ["Mailbox Trail", "Mailbox"],
    ["Carpet Snake Creek Firetrail", "Carpet Snake Creek"],
    ["Pretty Gully firetrail - Roayl NP", "Pretty Gully - Roayl NP"],
]


@pytest.mark.parametrize("scenario", scenarios, ids=lambda itm: itm[0])
def test(scenario, trails):
    line, expected = scenario
    result = remove_trails(line, trails)
    assert result == expected

The file defines the function removing not needed text from processed lines as well as it contains test case test_trails.

To test it, install pytest:

$ pip install pytest

Then run the test:

$ py.test -sv test_trails.py
========================================= test session starts ==================================
=======
platform linux2 -- Python 2.7.9, pytest-2.8.7, py-1.4.31, pluggy-0.3.1 -- /home/javl/.virtualenvs/stack
/bin/python2
cachedir: .cache
rootdir: /home/javl/sandbox/stack, inifile:
collected 5 items

test_trails.py::test[Poverty Point FT] PASSED
test_trails.py::test[Cedar Party Fire Trails] FAILED
test_trails.py::test[Mailbox Trail] PASSED
test_trails.py::test[Carpet Snake Creek Firetrail] PASSED
test_trails.py::test[Pretty Gully firetrail - Roayl NP] PASSED

================ FAILURES ==================
______ test[Cedar Party Fire Trails] _______

scenario = ['Cedar Party Fire Trails', 'Cedar Party Fire']
trails = ['Fire Trails', 'Fire trail', 'Fire Trail', 'Firetrail', 'firetrail', 'Trail', ...]

    @pytest.mark.parametrize("scenario", scenarios, ids=lambda itm: itm[0])
    def test(scenario, trails):
        line, expected = scenario
        result = remove_trails(line, trails)
>       assert result == expected
E       assert 'Cedar Party' == 'Cedar Party Fire'
E         - Cedar Party
E         + Cedar Party Fire
E         ?            +++++

test_trails.py:42: AssertionError
======== 1 failed, 4 passed in 0.01 seconds ============

The py.test command discovers in the file the test case, finds input arguments, uses injection to put into it the value of trails and parametrization of the test case provides the scenario parameter.

You may then fine tune the function remove_trails and list of trails untill all passes.

When you are finished, you may move the remove_trails function where you need (probably incl. trails list).

You may use this approach to test whatever of solutin proposed to your question.

Upvotes: 1

Bharel
Bharel

Reputation: 26901

I believe that's what you're looking for. You may also add the flag re.IGNORECASE like so res = re.split(regex, s, re.IGNORECASE) if you wish for it to be case insensitive. See re.split() for further documentation.

import re
trails = ("Fire trail", "Firetrail", "Fire Trail", "FT", "firetrail")

# \b means word boundaries.
regex = r"\b(?:{})\b".format("|".join(trails))

s = """Poverty Point FT
Cedar Party Fire Trails
Mailbox Trail
Carpet Snake Creek Firetrail
Pretty Gully firetrail - Roayl NP"""

res = re.split(regex, s)

UPDATE:

In case you go line by line, and don't want the end you can do this:

import re
trails = ("Fire trail", "Firetrail", "Fire Trail", "FT", "firetrail", "Trail", "Trails")

# \b means word boundaries.
regex = r"\b(?:{}).*".format("|".join(trails))

s = """Poverty Point FT
Cedar Party Fire Trails
Mailbox Trail
Carpet Snake Creek Firetrail
Pretty Gully firetrail - Roayl NP"""

res = [r.strip() for r in re.split(regex, s)]

Upvotes: 3

Saleem
Saleem

Reputation: 8978

Well, here is more dynamic way to perform task

import re

courses = r"""
Poverty Point FT
Cedar Party Fire Trails
Mailbox Trail
Carpet Snake Creek Firetrail
Pretty Gully firetrail - Roayl NP
"""

trails = ("Fire trail", "Firetrail", "Fire Trail", "FT", "firetrail")

rx_str = '|'.join(trails)
rx_str = r"^.+?(?=(?:{0}|$))".format(rx_str)

rx = re.compile(rx_str, re.IGNORECASE | re.MULTILINE)

for course in rx.finditer(courses):
    print(course.group())

As you can notice, I'm converting list into regex dynamically, without hardcoding. Script will render following result:

Poverty Point 
Cedar Party 
Mailbox Trail
Carpet Snake Creek 
Pretty Gully 

Upvotes: 1

midori
midori

Reputation: 4837

you can use re.split here:

import re

_list = re.split(r'Fire trail|Firetrail|Fire Trail|FT|firetrail', _string)

Upvotes: 1

donkopotamus
donkopotamus

Reputation: 23176

You could do this using a regular expression, for example:

def make_matcher(trails):
    import re
    rgx = re.compile(r"{}".format("|".join(trails)))
    return lambda txt: rgx.split(txt)[0]

>>> m = make_matcher(["Fire trail", "Firetrail", "Fire Trail", "FT", "firetrail"])
>>> examples = ["Poverty Point FT", "Cedar Party Fire Trails", "Mailbox Trail", "Carpet Snake Creek Firetrail", "Pretty Gully firetrail - Roayl NP"]
>>> for x in examples:
...     print(m(x))
Poverty Point 
Cedar Party 
Mailbox Trail
Carpet Snake Creek 
Pretty Gully 

Note that the in this example the trailing space before the occurrence of eg Firetrail are maintained. That might not be what you want.

Upvotes: 0

Related Questions