Reputation: 63
So I have a pandas dataframe with a column called 'year', which contains a bunch of years. Column looks like this:
0 1885
1 1828
2 1913
3 1906
4 1963
5 1906
6 1906
7 1903
8 1969
9 1958
10 1695
11 1889
12 1906
13 1884
14 1890
15 1873
16 1908
17 1974
18 1961
19 1963
20 1973
21 2005
22 1970
23 1852
24 1906
I am trying to use match() to find the year that is not in the right format. Dates should have 4 digits. The first character should be a 1 or 2. The second a 0, 7, 8 or 9. The last two characters should be digits. It is supposed to select the year 1695. I am also trying to use a function to cast the year as a string. There is another column called 'name' and I am trying to print the name that is on the same row as the year (index 10). This is my code so far:
y = re.match('^[3-9][1-6]*\d', df['year']).group()
def string(y):
return str(y)
string(y)
Any help is greatly appreciated, I am new to regular expression and have been stuck on this for hours. Thank you.
Upvotes: 0
Views: 1094
Reputation: 13417
You can modify your regex here and use the pandas .str.match
Series method to apply it:
# bad_date_mask is a boolean array,
# where True means we have a "good" date, and False is a "bad" date
bad_date_mask = df["year"].astype(str).str.match("^[12][0789]\d\d$")
print(df.loc[bad_date_mask])
year
10 1695
regex breakdown:
^
: The string MUST start with what comes after this symbol[12]
: Matches the characters 1 or 2[0789]
: Matches the characters 0, 7, 8, 9\d{2}
: Matches any 2 consecutive digits$
: The string must end now. Do not match if it continues.Upvotes: 2
Reputation: 11171
My suggestion would be to 1) break it up and 2) don't use regex unless needed. Lets create a boolean mask and then find rows where all conditions are met:
# Dates should have 4 digits
cond1 = (df.year.str.len() == 4)
# The first character should be a 1 or 2
cond2 = df.year.str.get(0).isin(("1","2"))
# The second a 0, 7, 8 or 9
cond3 = df.year.str.get(1).isin(("0","7","8","9"))
# The last two characters should be digits
cond4 = df.year.str[-2:].str.isnumeric()
joint_cond = cond1 & cond2 & cond3 & cond4
solution = df[joint_cond]
The suggested crieteria will not return the value 1695
Upvotes: 0