Reputation: 105
My initial data is:
Label Data:
0
1 1
2 1
3 1
4 1
5 1
... ..
11265 20
11266 20
11267 20
11268 20
11269 20
This is what I want:
[11269 rows x 1 columns]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
11265 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
11266 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
11267 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
11268 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
11269 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
The way I have attempted it is to loop through all lines of the matrix as follows:
uniqueLabels = labelData[0].unique().tolist()
docNums = range(1, len(labelData) + 1)
labelMatrix = pd.DataFrame(columns=uniqueLabels, index=docNums)
labelMatrix[:] = 0
for n in docNums:
labelMatrix[labelData[0][n]][n] += 1
print(labelMatrix)
Is there a more "pandasic" way of approaching this where I don't loop through every row? This is working for now, but I actually have millions of more rows of data and it takes longer than I would like. Thanks for your help!
SOLUTION: I ended up using the following and it worked great:
labelMatrix = pd.get_dummies(labelData[0])
Upvotes: 1
Views: 333