Reputation: 53
I have such data.
data = [
['2019-01-01', 'a',0],
['2019-01-02', 'b',0],
['2019-01-03', 'c',0],
['2019-01-04', 'd',0],
['2019-01-05', 'a',0],
['2019-01-05', 'd',0],
['2019-01-06', 'd',0],
['2019-01-07', 'f',0],
['2019-01-08', 'c',0],
['2019-01-08', 'b',0],
['2019-01-08', 'g',0],
['2019-01-08', 'h',0],
['2019-01-09', 'q',0],
['2019-01-09', 'b',0],
['2019-01-09', 'y',0],
['2019-01-10', 'd',0],
['2019-01-11', 'z',0],
['2019-01-11', 'x',0],
['2019-01-11', 'c',0],
['2019-01-12', 'y',0],
['2019-01-13', 'x',0],
['2019-01-13', 'q',0],
['2019-01-14', 't',0],
['2019-01-15', 'i',0]]
df = pd.DataFrame(data, columns = ['Date', 'Column1','Column2'])
Date Column1 Column2
0 2019-01-01 a 0
1 2019-01-02 b 0
2 2019-01-03 c 0
3 2019-01-04 d 0
4 2019-01-05 a 0
5 2019-01-05 d 0
6 2019-01-06 d 0
7 2019-01-07 f 0
8 2019-01-08 c 0
9 2019-01-08 b 0
10 2019-01-08 g 0
11 2019-01-08 h 0
12 2019-01-09 q 0
13 2019-01-09 b 0
14 2019-01-09 y 0
15 2019-01-10 d 0
16 2019-01-11 z 0
17 2019-01-11 x 0
18 2019-01-11 c 0
19 2019-01-12 y 0
20 2019-01-13 x 0
21 2019-01-13 q 0
22 2019-01-14 t 0
23 2019-01-15 i 0
My goal is to look at each column1 element and make the value of column2 1 if this element exists in column1 before.
I wrote a code like this.
for i in range(0,len(df)):
for j in range(0,i-1):
if df.Column1[i] == df.Column1[j]:
df.Column2[i] = 1
And I got the result I wanted.
Date Column1 Column2
0 2019-01-01 a 0
1 2019-01-02 b 0
2 2019-01-03 c 0
3 2019-01-04 d 0
4 2019-01-05 a 1
5 2019-01-05 d 1
6 2019-01-06 d 1
7 2019-01-07 f 0
8 2019-01-08 c 1
9 2019-01-08 b 1
10 2019-01-08 g 0
11 2019-01-08 h 0
12 2019-01-09 q 0
13 2019-01-09 b 1
14 2019-01-09 y 0
15 2019-01-10 d 1
16 2019-01-11 z 0
17 2019-01-11 x 0
18 2019-01-11 c 1
19 2019-01-12 y 1
20 2019-01-13 x 1
21 2019-01-13 q 1
22 2019-01-14 t 0
23 2019-01-15 i 0
But when I run this code on 100000 rows of data, it runs very slowly.
Is there a way to do this in a shorter time or are there different solution suggestions for this problem?
Thanks for answers
Upvotes: 5
Views: 182
Reputation: 75080
You can do groupby and cumcount on column1 and then clip the upper limit to 1:
df['Column2'] = df.groupby("Column1").cumcount().clip(upper=1)
However, even more concise would be to check for series.duplicated
here:
df['Column2'] = df['Column1'].duplicated().astype(int)
print(df)
Date Column1 Column2
0 2019-01-01 a 0
1 2019-01-02 b 0
2 2019-01-03 c 0
3 2019-01-04 d 0
4 2019-01-05 a 1
5 2019-01-05 d 1
6 2019-01-06 d 1
7 2019-01-07 f 0
8 2019-01-08 c 1
9 2019-01-08 b 1
10 2019-01-08 g 0
11 2019-01-08 h 0
12 2019-01-09 q 0
13 2019-01-09 b 1
14 2019-01-09 y 0
15 2019-01-10 d 1
16 2019-01-11 z 0
17 2019-01-11 x 0
18 2019-01-11 c 1
19 2019-01-12 y 1
20 2019-01-13 x 1
21 2019-01-13 q 1
22 2019-01-14 t 0
23 2019-01-15 i 0
Upvotes: 8
Reputation: 2159
Other answers utilizing pandas
functions are probably much faster, but here is a solution similar to your original which removes the double iteration.
exists = dict()
for i in range(len(df)):
df.Column2[i] = exists.get(df.Column1[i], 0)
if not df.Column2[i]:
exists[df.Column1[i]] = 1
print(df)
Upvotes: 0
Reputation: 195408
You can use .groupby
on Column1
with "cumcount" transform:
df["Column2"] = (
df.groupby("Column1", sort=False)["Column1"]
.transform("cumcount")
.gt(0)
.astype(int)
)
print(df)
Prints:
Date Column1 Column2
0 2019-01-01 a 0
1 2019-01-02 b 0
2 2019-01-03 c 0
3 2019-01-04 d 0
4 2019-01-05 a 1
5 2019-01-05 d 1
6 2019-01-06 d 1
7 2019-01-07 f 0
8 2019-01-08 c 1
9 2019-01-08 b 1
10 2019-01-08 g 0
11 2019-01-08 h 0
12 2019-01-09 q 0
13 2019-01-09 b 1
14 2019-01-09 y 0
15 2019-01-10 d 1
16 2019-01-11 z 0
17 2019-01-11 x 0
18 2019-01-11 c 1
19 2019-01-12 y 1
20 2019-01-13 x 1
21 2019-01-13 q 1
22 2019-01-14 t 0
23 2019-01-15 i 0
Upvotes: 4