Reputation: 94
I am cleaning data that has years in differing formats. There are seven possible values for the year field of my DataFrame: ['2013-14','2014-15','2015-16','2016-17','22017','22018','22019']
. I have resolved the problem by manually dealing with each case, as below:
matchups_df.loc[matchups_df['SEASON_ID'] == '22017', 'SEASON_ID'] = '2017-18'
matchups_df.loc[matchups_df['SEASON_ID'] == '22018', 'SEASON_ID'] = '2018-19'
matchups_df.loc[matchups_df['SEASON_ID'] == '22019', 'SEASON_ID'] = '2019-20'
My question is, why does the code below raise the exception ValueError: invalid literal for int() with base 10: '2016-17'
? I have removed the relevant portion from the np.where and used it on a filtered version of the DataFrame to only handle the necessary values, but it raises the same exception. Clearly, I have made some type of syntax eror in converting the string to int, but I haven't been to diagnose where the error lies.
matchups_df.insert(loc = 1, column = 'Season', value = (
np.where(
(len(matchups_df.SEASON_ID) == 5),
(
(matchups_df.SEASON_ID[1:]) +
"-" +
(str((matchups_df.SEASON_ID[3:].astype(int))+1))
),
matchups_df.SEASON_ID
)
)
)
Upvotes: 2
Views: 57
Reputation: 863751
Here is necessary use str
methods for check length by Series.str.len
and indexing for get all values after first by str[1:]
, also because both Series are processing for convert to numbers is used to_numeric
for avoid errors if no match correct format:
matchups_df = pd.DataFrame({'SEASON_ID':['2013-14','2014-15','2015-16','2016-17',
'22017','22018','22019'],
'col':range(7)})
print (matchups_df)
SEASON_ID col
0 2013-14 0
1 2014-15 1
2 2015-16 2
3 2016-17 3
4 22017 4
5 22018 5
6 22019 6
s = matchups_df.SEASON_ID.astype(str)
s1 = np.where(s.str.len() == 5,
s.str[1:] + "-" + pd.to_numeric(s.str[3:], errors='coerce')
.fillna(0).astype(int).add(1).astype(str),
matchups_df.SEASON_ID)
matchups_df.insert(loc = 1, column = 'Season', value = s1)
print (matchups_df)
SEASON_ID Season col
0 2013-14 2013-14 0
1 2014-15 2014-15 1
2 2015-16 2015-16 2
3 2016-17 2016-17 3
4 22017 2017-18 4
5 22018 2018-19 5
6 22019 2019-20 6
Another solution with custom function:
def f(x):
if len(x) == 5:
return x[1:] + "-" + str(int(x[3:]) + 1)
else:
return x
s1 = matchups_df.SEASON_ID.astype(str).apply(f)
matchups_df.insert(loc = 1, column = 'Season', value = s1)
print (matchups_df)
SEASON_ID Season col
0 2013-14 2013-14 0
1 2014-15 2014-15 1
2 2015-16 2015-16 2
3 2016-17 2016-17 3
4 22017 2017-18 4
5 22018 2018-19 5
6 22019 2019-20 6
Upvotes: 3
Reputation: 61654
The underlying issue here:
matchups_df.SEASON_ID[3:]
matchups_df.SEASON_ID
is the entire column (a Series). Slicing with [3:]
just takes off the first three rows; but instead you wanted to take off the first three characters of each value. Similarly, len(matchups_df.SEASON_ID) == 5
doesn't depend on the cell value (but instead on the length of the column, so all cells (from the fourth onward - so, the fourth and fifth of five) end up being processed - including the ones with strings like 2016-17
.
To get the behaviour you want, the .str
helper is provided, as shown in @jezrael's answer.
Upvotes: 1