Reputation: 79
I want to read a column where the first word in each row is the quarter, and year the survey was taken, as well as the name of the survey. Initially I was trying to rename the survey name where I was keeping the quarter and year constant throughout the column but if I ran this script against files from other quarters then the whole line would not be detected and my script would not work.
My example:
Survey Name
0 Q321 Your Voice - Information Tech
1 Q321 Your Voice - Information Tech
2 Q321 Your Voice - Information Tech
3 Q321 Your Voice - Information Tech
4 Q321 Your Voice - Information Tech
9630 Q321 Your Voice - Business Group
9631 Q321 Your Voice - Business Group
(Q321 = Quarter 3, 2021)
What my code converts it into:
Survey Name
0 Q321 YV - IT
1 Q321 YV - IT
2 Q321 YV - IT
3 Q321 YV - IT
4 Q321 YV - IT
9630 Q321 YV - BG
9631 Q321 YV - BG
The code I use:
print(df.loc[:, "Survey.Name"])
'isolate to column of interest and replace commonly incorrect string with the correct output'
df.loc[df['Survey.Name'].str.contains('Q321 Your Voice - Information Tech'), 'Survey.Name'] = \
'Q321 YV - IT'
df.loc[df['Survey.Name'].str.contains('Q321 Your Voice - Business Group'), 'Survey.Name'] = \
'Q321 YV - BG'
df.loc[df['Survey.Name'].str.contains('Q321 Your Voice - Study Group'), 'Survey.Name'] = \
'Q321 YV - SG'
print(df.loc[:, "Survey.Name"])
But say that I run this script against a file from a different quarter, say Quarter 4, 2021:
Survey Name
0 Q421 Your Voice - Information Tech
1 Q421 Your Voice - Information Tech
2 Q421 Your Voice - Information Tech
3 Q421 Your Voice - Information Tech
4 Q421 Your Voice - Information Tech
9630 Q421 Your Voice - Business Group
9631 Q421 Your Voice - Business Group
I would have to change my script every time a new quarter is used. Is there a way for me to "detect" the first word which luckily happens to be the quarter and year of the survey and include it in the transformed version, whilst also replacing the string that requires to be changed in that column?
Upvotes: 2
Views: 81
Reputation: 61900
One, perhaps over-complicated, approach is to use a regex with capture groups as below:
res = df["Survey Name"].str.replace(r"(Q\d+)\s+(\w)\w+ (\w)\w+ - (\w)\w+ (\w)\w+", r"\1 \2\3 - \4\5", regex=True)
print(res)
Output
0 Q321 YV - IT
1 Q321 YV - IT
2 Q321 YV - IT
3 Q321 YV - IT
4 Q321 YV - IT
5 Q321 YV - BG
6 Q321 YV - BG
Name: Survey Name, dtype: object
Note that the regex pattern captures the first word and the first letter of each of the remaining words.
Another alternative is to use apply, with a replacement function:
def repl(x):
head, tail = x.split("-")
quarter, *chunk = head.split()
head_initials = "".join(c[0] for c in chunk)
tail_initials = "".join(c[0] for c in tail.split())
return f"{quarter} {head_initials} - {tail_initials}"
res = df["Survey Name"].apply(repl)
Output
0 Q321 YV - IT
1 Q321 YV - IT
2 Q321 YV - IT
3 Q321 YV - IT
4 Q321 YV - IT
5 Q321 YV - BG
6 Q321 YV - BG
Name: Survey Name, dtype: object
UPDATE
A more general approach to the replacement part is to do:
replacements = {
"Your Voice - Information Tech": "YV - IT Group",
"Your Voice - Business Group": "YV - BG",
"Your Voice - Human Resources": "YV - LRECS"
}
def repl(match, repls=replacements):
quarter = match.group(1)
key = " ".join(match.group(2).strip().split())
return f"{quarter} {replacements.get(key, '')}"
res = df["Survey Name"].str.replace(r"(Q\d+)\s+(.+)", repl, regex=True)
print(res)
Output
0 Q321 YV - IT Group
1 Q321 YV - IT Group
2 Q321 YV - IT Group
3 Q321 YV - IT Group
4 Q321 YV - IT Group
5 Q321 YV - BG
6 Q321 YV - LRECS
Name: Survey Name, dtype: object
Note that replacements
is a dictionary where the keys are the string expected to be found and the values the corresponding replacements.
The data for the output above was generated from:
{'Survey Name': {0: 'Q321 Your Voice - Information Tech',
1: 'Q321 Your Voice - Information Tech',
2: 'Q321 Your Voice - Information Tech',
3: 'Q321 Your Voice - Information Tech',
4: 'Q321 Your Voice - Information Tech',
5: 'Q321 Your Voice - Business Group',
6: 'Q321 Your Voice - Human Resources'}}
Upvotes: 2
Reputation: 51
not sure if I got everything right but I think you can simply use split
for this:
Example:
t = "9631 Q421 Your Voice - Business Group".split()
print(t)
print(t[1])
# prints ['9631', 'Q421', 'Your', 'Voice', '-', 'Business', 'Group']
# prints 'Q421'
Upvotes: 0