Reputation: 53
I am trying to clean a column called 'historical_rank' in a pandas dataframe. It contains string data. Here is a sample of the content:
historical_rank
... ...
122 1908
123 O'
124
125 1911
126 1912
127 1913 * * * 2010 * * *
128
129 1914
130 1915
131
132
133 1918
134 (First served 1989 to 1999)
... ...
The data I want to retain are the four-digit numbers in rows 122, 125, 126, 127, 129, 130, and 133. Elsewhere in the series that number (the historical rank) may be one, two, or three digits. It always begins the string, and there is always a space after it. I want to use regex to keep the desired pattern -- r'\d{1,4}(?=\s)' -- and remove everything else throughout the series. What is the correct code to achieve this? Thank you.
Upvotes: 2
Views: 3351
Reputation: 188
I guess it would be more simple and efficient:
df['historical_rank_new'] = df['historical_rank'].str.extract('(\d{4})')
Upvotes: 0
Reputation: 163577
As an alternative, you could use str.replace
and use a pattern with a capturing group to keep what you want, and match what you want to remove.
^
Start of string(
Capture group 1 (Keep)
\d{1,4}
Match 1-4 digits )
Close group\s
Match a whitespace char|
Or.+
Match any char 1+ timesIn the replacement, use group 1 r'\1'
^(\d{1,4})\s|.+
For example
df.historical_rank = df.historical_rank.str.replace(r"^(\d{1,4})\s|.+", r'\1')
Upvotes: 3
Reputation: 26676
IICU
df['historical_rank_new']=df['historical_rank'].str.extract('(^[\d]{1,4})')
df
Upvotes: 2
Reputation: 627336
You should think of using your regex in a str.extract
method keeping in mind that this method requires the regex to contain at least one capturing group.
If you plan to match one, two, three or four digits at the start of the string that are followed by at least one whitespace (just judging by your \d{1,4}(?=\s)
pattern) you should try
df['historical_rank_clean'] = df['historical_rank'].str.extract('^(\d{1,4})\s', expand=False).fillna('')
Note the (...)
in the pattern, the paretheses form a capturing group and its contents will be used to fill the cells in the new historical_rank_clean
column. .fillna('')
will populate those entries with no match with an empty string.
Some other regex ideas:
r'^(\d{2}(?:\d{2})?)\b'
- extract two- or four-digit chunks at the start of the string that are followed with a word boundaryr'^((?:20|19)?\d{2})\b'
- similar to above, but only allowing years starting with 19
or 20
if these are four-digit years.See the regex demo
Upvotes: 1