Atapalou
Atapalou

Reputation: 61

Sort a duration expressed in minutes or season

The goal is to select movies or series that do not last more than 100 minutes. The problem is that the duration is expressed either in minutes or in number of seasons.

code:

import pandas as pd
import numpy as np

from kaggle:

url = 'netflix_titles.csv'
df1 = pd.read_csv(url)
df1.head()

view about 'duration'

df1['duration'].head(10)
0       90 min
1    2 Seasons
2     1 Season
3     1 Season
4    2 Seasons
5     1 Season
6       91 min
7      125 min
8    9 Seasons
9      104 min
Name: duration, dtype: object

My solution:

df_US['duree'] = df_US['duration'].apply(lambda x: str(x)).apply(lambda x:x.split(' ')).apply(lambda x: x[0])
df_US['duree'] = df_US['duree'].astype('float')
df_US['duree_unit'] = df_US['duration'].apply(lambda x: str(x)).apply(lambda x:x.split(' ')).apply(lambda x: x[-1])
df_US[(df_US['duree_unit'] == 'min') & (df_US['duree'] < 100)].head(3)

I get lots of warnings like:

C:\Users\Atapalou\AppData\Local\Temp\ipykernel_1436\2173588888.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_US['duree'] = df_US['duration'].apply(lambda x: str(x)).apply(lambda x:x.split(' ')).apply(lambda x: x[0]

This solution does not satisfy me, there must be a more elegant solution. Any idea?

Regards, Atapalou

Upvotes: 0

Views: 74

Answers (1)

FObersteiner
FObersteiner

Reputation: 25594

given the example

df
Out[46]: 
    duration
0     90 min
1  2 Seasons
2   1 Season
3   1 Season
4  2 Seasons
5   1 Season
6     91 min
7    125 min
8  9 Seasons
9    104 min

you can parse to timedelta and select (or also sort) like

df['duration'] = pd.to_timedelta(
    pd.to_numeric(
        df['duration'].str.replace('min', ''), 
        errors='coerce'
    ), 
    unit='T'
)

df[df['duration']<=pd.Timedelta(minutes=100)]

         duration
0 0 days 01:30:00
6 0 days 01:31:00

That basically ignores the seasons as those strings cannot be converted to a numeric value, pd.to_numeric(df['duration'].str.replace('min', ''), errors='coerce') returns NaN in that cases (which converts to NaT for the timedelta).

Upvotes: 1

Related Questions