Reputation: 115
I'm working with college basketball data. The two fields I have right now are the raw matchup and the predicted winner.
RawMatchup | PredictedWinner |
---|---|
MinnesotaLouisville | Louisville |
I want to use the Predicted Winner to separate out the two teams in the RawMatchup column. Currently I'm using replace to remove the Predicted Winner from the RawMatchup.
RawMatchup.replace(PredictedWinner, '')
>>Minnesota
This works for the vast majority of the rows in my dataset. The problem I'm having is when both school's partially share a name
RawMatchup | PredictedWinner |
---|---|
GeorgiaGeorgia Tech | Georgia |
North Carolina CentralNorth Carolina | North Carolina |
Using split for these two rows returns just 'Tech' and 'Central' (instead 'Georgia Tech' and 'North Carolina Central'). How can I best separate the Predicted Winner from the Raw Matchup while preserving the correct school names?
Upvotes: 1
Views: 49
Reputation: 579
I wouldn't use split
because IMO it's intended for a different purpose (usually splitting the elements by standard separators such as commas, or whitespaces). In this case, what you want is removing PredictedWinner
from RawMatchup
only once. Therefore I'd go for replace
and sub
to achieve the goal.
It seems that PredictedWinner
is either at the end or at the beginning of RawMatchup
. We could take advantage of that to define the following function:
import re
def remove_winner_from_raw(raw_matchup, predicted_winner):
if (raw_matchup.endswith(predicted_winner)):
res = re.sub(f"{predicted_winner}$", '', raw_matchup) # regexp
else:
res = raw_matchup.replace(predicted_winner, '', 1) # Just the 1st occurrence
return res
print(remove_winner_from_raw("North Carolina CentralNorth Carolina", "North Carolina"))
# Output: North Carolina Central
print(remove_winner_from_raw("GeorgiaGeorgia Tech", "Georgia"))
# Output: Georgia Tech
Docs for:
str.replace
: https://docs.python.org/3/library/stdtypes.htmlre.sub
: https://docs.python.org/3/library/re.htmlUpvotes: 1