Reputation: 327
The original time data is like this:
df['time'][0:4]
2015-07-08
05-11
05-12
2008-07-26
I want all these data contains year value. And I applied this:
con_time = []
i=0
for i in df['time']:
if len(df['time'])==5:
time = '2018'+'-'+df['time']
con_time.append(time)
i +=1
else:
con_time.append(df['time'])
i +=1
Error occurred:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-78-b7d87c72f412> in <module>()
8 else:
9 con_time.append(df['time'])
---> 10 i +=1
TypeError: must be str, not int
This error is so strange.... Actually I want to create a new list, converting it to a np.array and concat it into the df. Do I have a better way to achieve the goal?
Upvotes: 2
Views: 1148
Reputation: 3223
Since you have asked about an alternative approach. instead of an explicit loop in python and filling a list, one should rather use DataFrame methods directly. In your case this would be
df['time'].apply(lambda x: x if len(x) != 5 else '2018-'+x)
This might run faster for some datasets
EDIT I actually ran a timing benchmark using a random toy dataset with ~50% of complete and incomplete dates. In short, it seems that for a small dataset the simple for-loop solution is faster for a large dataset both methods show similar performance:
# 1M examples
import random
import numpy as np
y = pd.Series(np.random.randint(0,2,1000000))
s = {0:'2015-07-08', 1:'05-11'}
y = y.map(s)
%%timeit -n100
_ = y.apply(lambda x: x if len(x) != 5 else '2018-'+x)
>>> 275 ms ± 6.42 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n100
con_time = []
for i in y:
if len(i)==5:
time = '2018-'+i
con_time.append(time)
else:
con_time.append(i)
con_time_a = np.array(con_time)
>>> 289 ms ± 5.23 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 1K examples
import random
import numpy as np
y = pd.Series(np.random.randint(0,2,1000))
s = {0:'2015-07-08', 1:'05-11'}
y = y.map(s)
%%timeit -n100
_ = y.apply(lambda x: x if len(x) != 5 else '2018-'+x)
>>> 431 µs ± 70.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n100
con_time = []
for i in y:
if len(i)==5:
time = '2018-'+i
con_time.append(time)
else:
con_time.append(i)
con_time_a = np.array(con_time)
>>> 289 µs ± 40.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 3
Reputation: 7412
you have two i
variables, when you make i += 1
you take i
variable from
for i in df['time']
not
i = 0
change i
variable from for loop with another name , for example if you don't need variable from for loop statment you can name it _
(underscore)
Upvotes: 2