Reputation: 77
I am trying to parse the vintage year from the titles of wines. I look to be getting around 50% accuracy with the code below but would like to improve this if possible. Does anybody know what I can do to improve accuracy?
Example titles and their parsed year being returned:
Quinta dos Avidagos 2011 Avidagos Red (Douro) -> 0 incorrect
Rainstorm 2013 Pinot Gris (Willamette Valley) -> 2011 incorrect
Louis M. Martini 2012 Cabernet Sauvignon -> 2012 correct
Mirassou 2012 Chardonnay (Central Coast) -> 2012 correct
Code I am implementing:
from dateutil.parser import parse
from datetime import datetime, timezone
df = "my pandas dataframe with wine titles"
dt = datetime.now()
dt.replace(tzinfo=timezone.utc)
year_parse = []
for i in range(len(df['title'])):
try:
ans = parse(df.title[i], fuzzy=True).year
year_parse.append(int(ans))
except:
ans = 0
year_parse.append(int(ans))
Very grateful for any suggestions!
Upvotes: 3
Views: 82
Reputation: 11228
You can use regex for this. I am hoping that wine name has same pattern .
import re
exp = re.compile(r'\d{4}')
year_parse = list()
for name in df['title']:
year = exp.findall(name)[0]
year_parse.append(year)
year_parse got all the year in a list.
Upvotes: 4