pythonscrub1234
pythonscrub1234

Reputation: 63

Using regex to find a data in an improper format in pandas data frame?

So I have a pandas dataframe with a column called 'year', which contains a bunch of years. Column looks like this:

0     1885
1     1828
2     1913
3     1906
4     1963
5     1906
6     1906
7     1903
8     1969
9     1958
10    1695
11    1889
12    1906
13    1884
14    1890
15    1873
16    1908
17    1974
18    1961
19    1963
20    1973
21    2005
22    1970
23    1852
24    1906

I am trying to use match() to find the year that is not in the right format. Dates should have 4 digits. The first character should be a 1 or 2. The second a 0, 7, 8 or 9. The last two characters should be digits. It is supposed to select the year 1695. I am also trying to use a function to cast the year as a string. There is another column called 'name' and I am trying to print the name that is on the same row as the year (index 10). This is my code so far:

y = re.match('^[3-9][1-6]*\d', df['year']).group()
def string(y):
    return str(y)
string(y)

Any help is greatly appreciated, I am new to regular expression and have been stuck on this for hours. Thank you.

Upvotes: 0

Views: 1094

Answers (2)

Cameron Riddell
Cameron Riddell

Reputation: 13417

You can modify your regex here and use the pandas .str.match Series method to apply it:

# bad_date_mask is a boolean array,
#   where True means we have a "good" date, and False is a "bad" date
bad_date_mask = df["year"].astype(str).str.match("^[12][0789]\d\d$")

print(df.loc[bad_date_mask])
    year
10  1695

regex breakdown:

  • ^: The string MUST start with what comes after this symbol
  • [12]: Matches the characters 1 or 2
  • [0789]: Matches the characters 0, 7, 8, 9
  • \d{2}: Matches any 2 consecutive digits
  • $: The string must end now. Do not match if it continues.

Upvotes: 2

anon01
anon01

Reputation: 11171

My suggestion would be to 1) break it up and 2) don't use regex unless needed. Lets create a boolean mask and then find rows where all conditions are met:

# Dates should have 4 digits
cond1 = (df.year.str.len() == 4)

# The first character should be a 1 or 2
cond2 = df.year.str.get(0).isin(("1","2"))

# The second a 0, 7, 8 or 9
cond3 = df.year.str.get(1).isin(("0","7","8","9"))

# The last two characters should be digits
cond4 = df.year.str[-2:].str.isnumeric()


joint_cond = cond1 & cond2 & cond3 & cond4
solution = df[joint_cond]

The suggested crieteria will not return the value 1695

Upvotes: 0

Related Questions