Ed_in_NY
Ed_in_NY

Reputation: 53

Clean pandas series using regex

I am trying to clean a column called 'historical_rank' in a pandas dataframe. It contains string data. Here is a sample of the content:

       historical_rank
...    ...
122    1908
123    O'   
124 
125    1911  
126    1912  
127    1913 * * * 2010 * * *  
128
129    1914  
130    1915
131  
132
133    1918  
134    (First served 1989 to 1999)
...    ...

The data I want to retain are the four-digit numbers in rows 122, 125, 126, 127, 129, 130, and 133. Elsewhere in the series that number (the historical rank) may be one, two, or three digits. It always begins the string, and there is always a space after it. I want to use regex to keep the desired pattern -- r'\d{1,4}(?=\s)' -- and remove everything else throughout the series. What is the correct code to achieve this? Thank you.

Upvotes: 2

Views: 3351

Answers (5)

Anisa B.
Anisa B.

Reputation: 188

I guess it would be more simple and efficient: df['historical_rank_new'] = df['historical_rank'].str.extract('(\d{4})')

Upvotes: 0

The fourth bird
The fourth bird

Reputation: 163577

As an alternative, you could use str.replace and use a pattern with a capturing group to keep what you want, and match what you want to remove.

  • ^ Start of string
  • ( Capture group 1 (Keep)
    • \d{1,4} Match 1-4 digits
  • ) Close group
  • \s Match a whitespace char
  • | Or
  • .+ Match any char 1+ times

In the replacement, use group 1 r'\1'

^(\d{1,4})\s|.+

Regex demo

For example

df.historical_rank = df.historical_rank.str.replace(r"^(\d{1,4})\s|.+", r'\1')

Upvotes: 3

wwnde
wwnde

Reputation: 26676

IICU

df['historical_rank_new']=df['historical_rank'].str.extract('(^[\d]{1,4})')
df

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627336

You should think of using your regex in a str.extract method keeping in mind that this method requires the regex to contain at least one capturing group.

If you plan to match one, two, three or four digits at the start of the string that are followed by at least one whitespace (just judging by your \d{1,4}(?=\s) pattern) you should try

df['historical_rank_clean'] = df['historical_rank'].str.extract('^(\d{1,4})\s', expand=False).fillna('')

Note the (...) in the pattern, the paretheses form a capturing group and its contents will be used to fill the cells in the new historical_rank_clean column. .fillna('') will populate those entries with no match with an empty string.

Some other regex ideas:

  • r'^(\d{2}(?:\d{2})?)\b' - extract two- or four-digit chunks at the start of the string that are followed with a word boundary
  • r'^((?:20|19)?\d{2})\b' - similar to above, but only allowing years starting with 19 or 20 if these are four-digit years.

See the regex demo

Upvotes: 1

mucio
mucio

Reputation: 7119

To improve @wwnde answer, you could use:

df['historical_rank_new']=df['historical_rank'].str.extract('(^\d{1,4}$)')

Here is on repl.it

Upvotes: 0

Related Questions