Scythor
Scythor

Reputation: 79

How to detect the first word and include it in a string replacement line in python?

I want to read a column where the first word in each row is the quarter, and year the survey was taken, as well as the name of the survey. Initially I was trying to rename the survey name where I was keeping the quarter and year constant throughout the column but if I ran this script against files from other quarters then the whole line would not be detected and my script would not work.

My example:

        Survey Name
0       Q321 Your Voice - Information Tech
1       Q321 Your Voice - Information Tech
2       Q321 Your Voice - Information Tech
3       Q321 Your Voice - Information Tech
4       Q321 Your Voice - Information Tech
                
9630    Q321 Your Voice - Business Group
9631    Q321 Your Voice - Business Group

(Q321 = Quarter 3, 2021)

What my code converts it into:

Survey Name
0       Q321 YV - IT
1       Q321 YV - IT
2       Q321 YV - IT
3       Q321 YV - IT
4       Q321 YV - IT
                
9630    Q321 YV - BG
9631    Q321 YV - BG

The code I use:

print(df.loc[:, "Survey.Name"])

'isolate to column of interest and replace commonly incorrect string with the correct output'

df.loc[df['Survey.Name'].str.contains('Q321 Your Voice - Information Tech'), 'Survey.Name'] = \
    'Q321 YV - IT'

df.loc[df['Survey.Name'].str.contains('Q321 Your Voice - Business Group'), 'Survey.Name'] = \
    'Q321 YV - BG'

df.loc[df['Survey.Name'].str.contains('Q321 Your Voice - Study Group'), 'Survey.Name'] = \
    'Q321 YV - SG'
        
    
print(df.loc[:, "Survey.Name"])

But say that I run this script against a file from a different quarter, say Quarter 4, 2021:

Survey Name
0       Q421 Your Voice - Information Tech
1       Q421 Your Voice - Information Tech
2       Q421 Your Voice - Information Tech
3       Q421 Your Voice - Information Tech
4       Q421 Your Voice - Information Tech

9630    Q421 Your Voice - Business Group
9631    Q421 Your Voice - Business Group

I would have to change my script every time a new quarter is used. Is there a way for me to "detect" the first word which luckily happens to be the quarter and year of the survey and include it in the transformed version, whilst also replacing the string that requires to be changed in that column?

Upvotes: 2

Views: 81

Answers (2)

Dani Mesejo
Dani Mesejo

Reputation: 61900

One, perhaps over-complicated, approach is to use a regex with capture groups as below:

res = df["Survey Name"].str.replace(r"(Q\d+)\s+(\w)\w+ (\w)\w+ - (\w)\w+ (\w)\w+", r"\1 \2\3 - \4\5", regex=True)
print(res)

Output

0    Q321 YV - IT
1    Q321 YV - IT
2    Q321 YV - IT
3    Q321 YV - IT
4    Q321 YV - IT
5    Q321 YV - BG
6    Q321 YV - BG
Name: Survey Name, dtype: object

Note that the regex pattern captures the first word and the first letter of each of the remaining words.

Another alternative is to use apply, with a replacement function:

def repl(x):
    head, tail = x.split("-")
    quarter, *chunk = head.split()

    head_initials = "".join(c[0] for c in chunk)
    tail_initials = "".join(c[0] for c in tail.split())

    return f"{quarter} {head_initials} - {tail_initials}"


res = df["Survey Name"].apply(repl)

Output

0    Q321 YV - IT
1    Q321 YV - IT
2    Q321 YV - IT
3    Q321 YV - IT
4    Q321 YV - IT
5    Q321 YV - BG
6    Q321 YV - BG
Name: Survey Name, dtype: object

UPDATE

A more general approach to the replacement part is to do:

replacements = {
    "Your Voice - Information Tech": "YV - IT Group",
    "Your Voice - Business Group": "YV - BG",
    "Your Voice - Human Resources": "YV - LRECS"
}


def repl(match, repls=replacements):
    quarter = match.group(1)
    key = " ".join(match.group(2).strip().split())

    return f"{quarter} {replacements.get(key, '')}"


res = df["Survey Name"].str.replace(r"(Q\d+)\s+(.+)", repl, regex=True)
print(res)

Output

0    Q321 YV - IT Group
1    Q321 YV - IT Group
2    Q321 YV - IT Group
3    Q321 YV - IT Group
4    Q321 YV - IT Group
5          Q321 YV - BG
6       Q321 YV - LRECS
Name: Survey Name, dtype: object

Note that replacements is a dictionary where the keys are the string expected to be found and the values the corresponding replacements.

The data for the output above was generated from:

{'Survey Name': {0: 'Q321 Your Voice - Information Tech',
                 1: 'Q321 Your Voice - Information Tech',
                 2: 'Q321 Your Voice - Information Tech',
                 3: 'Q321 Your Voice - Information Tech',
                 4: 'Q321 Your Voice - Information Tech',
                 5: 'Q321 Your Voice - Business Group',
                 6: 'Q321 Your Voice - Human Resources'}}

Upvotes: 2

W00dy
W00dy

Reputation: 51

not sure if I got everything right but I think you can simply use split for this:

Example:

t = "9631    Q421 Your Voice - Business Group".split()
print(t)
print(t[1])
# prints ['9631', 'Q421', 'Your', 'Voice', '-', 'Business', 'Group']
# prints 'Q421'

Upvotes: 0

Related Questions