toblKr
toblKr

Reputation: 151

Multiple binary columns to one column

I have a CSV file dataset that contains 21 columns, the first 10 columns are numbers and I don't want to change them. The next 10 columns are binary data and contain only 1 and 0 in it, one "1" and the others are "0", and the last column is the given label.

the example data looks like below

2596,51,3,258,0,510,221,232,148,6279,24(10th column),0,0,0,0,0,1(16th column),0,0,0,0,2(the last column)

Suppose I load the data into a matrix, can I keep the first 10 columns and the last column unchanged, and convert the middle 10 columns into one column? After transformation, I want the column value to be based on the index of the "1" in the row, like the row above, the wanted result is

2596,51,3,258,0,510,221,232,148,6279,24,6(it's 6 because the "1" is on 6th column of the binary data),2 #12 columns in total

Can I achieve this using NumPy, scikit-learn or something else?

Upvotes: 2

Views: 1407

Answers (4)

Daniel F
Daniel F

Reputation: 14399

This should do it if it is loaded into a numpy array

out = np.c_[in[:, :11], np.where(in[:, 11:-1])[1] + 1, in[:, -1]]

Upvotes: 2

tuomastik
tuomastik

Reputation: 4916

from io import StringIO

import pandas as pd

csv = StringIO("2596,51,3,258,0,510,221,232,148,6279,24,0,0,0,0,0,1,0,0,0,0,2"
               "\n1,2,3,4,5,6,7,8,9,10,11,0,0,0,0,1,0,0,0,0,0,1")

df = pd.read_csv(csv, header=None)

df = pd.concat(objs=[df[df.columns[:11]],
                     df[df.columns[11:-1]].idxmax(axis=1) - 10,
                     df[df.columns[-1]]], axis=1)

print(df)

Output:

     0   1   2    3   4    5    6    7    8     9   10  0   21
0  2596  51   3  258   0  510  221  232  148  6279  24   6   2
1     1   2   3    4   5    6    7    8    9    10  11   5   1

Upvotes: 1

Allen Qin
Allen Qin

Reputation: 19957

Setup

df = pd.DataFrame({0: {2596: 51},
 1: {2596: 3},
 2: {2596: 258},
 3: {2596: 0},
 4: {2596: 510},
 5: {2596: 221},
 6: {2596: 232},
 7: {2596: 148},
 8: {2596: 6279},
 9: {2596: 24},
 10: {2596: 0},
 11: {2596: 0},
 12: {2596: 0},
 13: {2596: 0},
 14: {2596: 0},
 15: {2596: 1},
 16: {2596: 0},
 17: {2596: 0},
 18: {2596: 0},
 19: {2596: 0},
 20: {2596: 2}})

Solution

#find the index of the column with value 1 within the 10 columns
df.iloc[:,10] = np.argmax(df.iloc[:,10:20].values,axis=1)+1

#select the first 10 columns, the position column and the label column
df.iloc[:,list(range(11))+[20]]

Out[2167]: 
      0   1    2   3    4    5    6    7     8   9   10  20
2596  51   3  258   0  510  221  232  148  6279  24   6   2

Upvotes: 0

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210932

Data:

In [135]: df
Out[135]:
     0   1   2    3   4    5    6    7    8     9  ...  12  13  14  15  16  17  18  19  20  21
0  2596  51   3  258   0  510  221  232  148  6279 ...   0   0   0   0   1   0   0   0   0   2
1  2596  51   3  258   0  510  221  232  148  6279 ...   0   0   0   0   0   0   0   0   1   2

[2 rows x 22 columns]

Solution:

df = pd.read_csv('/path/to/file.csv', header=None)

In [137]: df.iloc[:, :11] \
            .join(df.iloc[:, 11:21].dot(range(1,11)).to_frame(11)) \
            .join(df.iloc[:, -1])
Out[137]:
     0   1   2    3   4    5    6    7    8     9   10  11  21
0  2596  51   3  258   0  510  221  232  148  6279  24   6   2
1  2596  51   3  258   0  510  221  232  148  6279  24  10   2

Upvotes: 0

Related Questions