Reputation: 309

Extract specific letters from text using regex and compare with dictionary

I am having a list of texts which is 90% in format AABBB-CCCDDD001. And there are also few texts in this list which may consist of

AABBBICS-CCCDDD001 or 
AABBBIGW-CCCDDD001 or 
AABBBRTL-CCCDDD001 or 
AABBBTDZ-CCCDDD001

These are the name of the devices where

AA - country code
BBB - site code
CCC - Function code
DDD - Sub Function code.

It could be for example: USNYCRTL-LANDCE001

If the code ICS, IGW, RTL or TDZ either matches with that in the text, I want it to output their respective number, for which I have created a dictionary:

ENVIRONMENTCODE = {
    'ICS': '1',
    'IGW': '2',
    'RTL': '3',
    'TDZ': '4'
}

NULLCODE = {
    'NULL': '9'
}

So, if the text is:

AABBBICS-CCCDDD001 it should print '1' or 
AABBBIGW-CCCDDD001 it should print '2' or 
AABBBRTL-CCCDDD001 it should print '3' or 
AABBBTDZ-CCCDDD001 it should print '4'

above example: USNYCRTL-LANDCE001 should print '3' since RTL correspondes to number '3' in dictionary

Now, for the 90% of text which is in format AABBB-CCCDDD001 should print '9' as it should pair with the key 'NULL'. Also, there maybe few texts where it can have AABBBXYZ-CCCDDD001, But we need to ignore that XYZ as it is not in the dictionary and consider only the ones that are in the dictionary. And mark that text as '9' as well.

I know regex can be used here, but I'm in the early stages of learning python and Regex seems to be out of reach for me right now. This is what I have tried so far:

def environmentcode(self):
    idx = self.name.find('-')
    if idx > -1:
        if self.name in ENVIRONMENTCODE:
            return ENVIRONMENTCODE
        else:
            return NULLCODE
    else:
        return "Not Found"

It is printing the NULLCODE dictionary only regardless of the keys are there in the text or not. Can anyone please help me with this.

Upvotes: 2

Answers (3)

kantal

Reputation: 2407

My proposal:

def environmentcode(s):
    if "-" not in s:  #(**)
        return None   #(**)
    h,t=s.split("-")
    code=h.strip()[5:]
    return ENVIRONMENTCODE.get(code,9)   

data="AABBBICS-CCCDDD001 AABBBIGW-CCCDDD001 AABBBRTL-CCCDDD001 AABBBTDZ-CCCDDD001 USNYCRTL-LANDCE001 AABBB-CCCDDD001 something"

for s in data.split():
    print(s,"-->",environmentcode(s))

Output:
AABBBICS-CCCDDD001 --> 1
AABBBIGW-CCCDDD001 --> 2
AABBBRTL-CCCDDD001 --> 3
AABBBTDZ-CCCDDD001 --> 4
USNYCRTL-LANDCE001 --> 3
AABBB-CCCDDD001 --> 9
something --> None

#---------------------------------------------------------
# Filtering text with regex. In this case, (**) not needed.
text="""AABBBICS-CCCDDD001 Alice was beginning to get very tired of sitting by her sister on the bank... AABBBIGW-CCCDDD001 AABBBRTL-CCCDDD001 AABBBTDZ-CCCDDD001 USNYCRTL-LANDCE001 AABBB-CCCDDD001 AABBBXYZ-CCCDDD001 something"""

import re

data= re.findall(r"\b[A-Z]{5,8}-[A-Z]{6}001\b",text)
for s in data:
    print(s,"-->",environmentcode(s))

Upvotes: 0

PM 2Ring

Reputation: 55469

We can use .find to get the code word, if it exists, and then use the dictionary to map the code word to its code number. We can use the dictionary .get method to return the null code for missing or unknown code words. This version returns None if it encounters bad data: a name that doesn't contain '-', or a name that doesn't have either 8 or 5 letters before the '-'.

env_code = {
    'ICS': '1',
    'IGW': '2',
    'RTL': '3',
    'TDZ': '4',
}

null_code = '9'

def get_env_code(name):
    idx = name.find('-')
    if idx == 8:
        # code may be valid
        code = name[idx-3:idx]
    elif idx == 5:
        # code is missing
        code = ''
    else:
        # Bad name
        return None

    return env_code.get(code, null_code)

# test

data = [
    'AABBBICS-CCCDDD001',
    'AABBBIGW-CCCDDD001',
    'AABBBRTL-CCCDDD001',
    'AABBBTDZ-CCCDDD001',
    'USNYCRTL-LANDCE001',
    'AABBBXYZ-CCCDDD001',
    'AABBB-CCCDDD001',
    'BADDATA',
]

for s in data:
    print(s, get_env_code(s))

output

AABBBICS-CCCDDD001 1
AABBBIGW-CCCDDD001 2
AABBBRTL-CCCDDD001 3
AABBBTDZ-CCCDDD001 4
USNYCRTL-LANDCE001 3
AABBBXYZ-CCCDDD001 9
AABBB-CCCDDD001 9
BADDATA None

Here's a simpler version that returns the null code instead of None for bad data.

def get_env_code(name):
    idx = name.find('-')
    code = name[idx-3:idx] if idx == 8 else ''
    return env_code.get(code, null_code)

Upvotes: 1

csunday95

Reputation: 1319

If you're just checking if a member of ENVIRONMENTCODE is found within each test string, then regex not necessary. You can just use the python keyword in, e.g.

ENVIRONMENTCODE = {
    'ICS': '1',
    'IGW': '2',
    'RTL': '3',
    'TDZ': '4'
}

NULLCODE = {
    'NULL': '9'
}

def environment_code(test_string, code_dict):
    if '-' not in test_string:
        return 'no dash'
    for code, value in code_dict.items():
        if code in test_string:
            return value
    return NULLCODE['NULL']


to_test = ['AABBBICS-CCCDDD001',
           'AABBBIGW-CCCDDD001',
           'AABBBRTL-CCCDDD001',
           'AABBBTDZ-CCCDDD001']
for test_str in to_test:
    print(environment_code(test_str, ENVIRONMENTCODE))

The problem with your original code was that you were trying to do

test_string in code_dict

which only checks for exact matches between the string under test and the keys withint the dictionary.

Upvotes: 0

Extract specific letters from text using regex and compare with dictionary

Answers (3)

Related Questions