Reputation: 75
I'm trying to create a year column with the year taken from the title column in my dataframe. This code works, but the column dtype is object. For example, in row 1 the year displays as [2013].
How can i do this, but change the column dtype to a float?
year_list = []
for i in range(title_length):
year = re.findall('\d{4}', wine['title'][i])
year_list.append(year)
wine['year'] = year_list
Here is the head of my dataframe:
country designation points province title year
Italy Vulkà Bianco 87 Sicily Nicosia 2013 Vulkà Bianco [2013]
Upvotes: 2
Views: 216
Reputation: 27485
re.findall
returns a list of results. Use re.search
wine['year'] = [re.search('\d{4}', title)[0] for title in wine['title']]
better yet use pandas extract
method.
wine['year'] = wine['title'].str.extract(r'\d{4}')
Definition
Series.str.extract(pat, flags=0, expand=True)
For each subject string in the Series, extract groups from the first match of regular expression pat.
Upvotes: 2
Reputation: 626738
Instead of re.findall
that returns a list of strings, you may use str.extract()
:
wine['year'] = wine['title'].str.extract(r'\b(\d{4})\b')
Or, in case you want to only match 1900-2000s years:
wine['year'] = wine['title'].str.extract(r'\b((?:19|20)\d{2})\b')
Note that the pattern in str.extract
must contain at least 1 capturing group, its value will be used to populate the new column. The first match will only be considered, so you might have to precise the context later if need be.
I suggest using word boundaries \b
around the \d{4}
pattern to match 4-digit chunks as whole words and avoid partial matches in strings like 1234567890
.
Upvotes: 2