Reputation: 61
The goal is to select movies or series that do not last more than 100 minutes. The problem is that the duration is expressed either in minutes or in number of seasons.
code:
import pandas as pd
import numpy as np
from kaggle:
url = 'netflix_titles.csv'
df1 = pd.read_csv(url)
df1.head()
view about 'duration'
df1['duration'].head(10)
0 90 min
1 2 Seasons
2 1 Season
3 1 Season
4 2 Seasons
5 1 Season
6 91 min
7 125 min
8 9 Seasons
9 104 min
Name: duration, dtype: object
My solution:
df_US['duree'] = df_US['duration'].apply(lambda x: str(x)).apply(lambda x:x.split(' ')).apply(lambda x: x[0])
df_US['duree'] = df_US['duree'].astype('float')
df_US['duree_unit'] = df_US['duration'].apply(lambda x: str(x)).apply(lambda x:x.split(' ')).apply(lambda x: x[-1])
df_US[(df_US['duree_unit'] == 'min') & (df_US['duree'] < 100)].head(3)
I get lots of warnings like:
C:\Users\Atapalou\AppData\Local\Temp\ipykernel_1436\2173588888.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_US['duree'] = df_US['duration'].apply(lambda x: str(x)).apply(lambda x:x.split(' ')).apply(lambda x: x[0]
This solution does not satisfy me, there must be a more elegant solution. Any idea?
Regards, Atapalou
Upvotes: 0
Views: 74
Reputation: 25594
given the example
df
Out[46]:
duration
0 90 min
1 2 Seasons
2 1 Season
3 1 Season
4 2 Seasons
5 1 Season
6 91 min
7 125 min
8 9 Seasons
9 104 min
you can parse to timedelta and select (or also sort) like
df['duration'] = pd.to_timedelta(
pd.to_numeric(
df['duration'].str.replace('min', ''),
errors='coerce'
),
unit='T'
)
df[df['duration']<=pd.Timedelta(minutes=100)]
duration
0 0 days 01:30:00
6 0 days 01:31:00
That basically ignores the seasons as those strings cannot be converted to a numeric value, pd.to_numeric(df['duration'].str.replace('min', ''), errors='coerce')
returns NaN in that cases (which converts to NaT for the timedelta).
Upvotes: 1