spd
spd

Reputation: 354

Classification based on categorical data

I have a dataset

Inp1    Inp2        Output
A,B,C   AI,UI,JI    Animals
L,M,N   LI,DO,LI    Noun
X,Y     AI,UI       Extras

For these values, I need to apply a ML algorithm. Which algorithm would be best suited to find relations in between these groups to assign an output class to them?

Upvotes: 1

Views: 297

Answers (3)

prosti
prosti

Reputation: 46351

BCE is for multi-label classifications, whereas categorical CE is for multi-class classification where each example belongs to a single class. In your task you need to understand if for a single example you end in a single class only (CE) or single example may end in multiple classes (BCE). Probable the second is true since animal can be a noun. ;)

Upvotes: 1

Bhanuchander Udhayakumar
Bhanuchander Udhayakumar

Reputation: 1621

As you mentioned, you are going to apply ML algorithm (say classification), I think One Hot Encoding is what you are looking for.

Requested format:

Inp1     Inp2    Inp3      Output
7,44,87  4,65,2  47,36,20  45

This format can't help you to train your model as multiple labels in a single cell. However you have to pre-process again like OHE.

Suggesting format:

A  B  C  L  M  N  X  Y  AI  DO  JI  LI  UI  Apple  Bat  Dog  Lawn  Moon  Noon  Yemen  Zombie
1  1  1  0  0  0  0  0   1   0   1   0   1      1    1    1     0     0     0      0       0
0  0  0  1  1  1  0  0   0   1   0   1   0      0    0    0     1     1     1      0       0
0  0  0  0  0  0  1  1   1   0   0   0   1      0    0    0     0     0     0      1       1

Hereafter you can label encode / ohe the output field as per your model requires.

Happy learning !

Upvotes: 1

thomas
thomas

Reputation: 104

Assuming each cell is a list (as you have multiple strings stored in each), and that you are not looking for a specific encoding. The following should work. It can also be adjusted to suit different encodings.

import pandas as pd
A = [["Inp1", "Inp2", "Inp3", "Output"],
[["A","B","C"], ["AI","UI","JI"],["Apple","Bat","Dog"],["Animals"]],
[["L","M","N"], ["LI","DO","LI"], ["Lawn", "Moon", "Noon"], ["Noun"]]]

dataframe = pd.DataFrame(A[1:], columns=A[0])

def my_encoding(row):
    encoded_row = []
    for ls in row:
        encoded_ls = []
        for s in ls:
            sbytes = s.encode('utf-8')
            sint = int.from_bytes(sbytes, 'little')
            encoded_ls.append(sint)
        encoded_row.append(encoded_ls)
    return encoded_row

print(dataframe.apply(my_encoding))

output:

           Inp1  ...               Output
0  [65, 66, 67]  ...  [32488788024979009]
1  [76, 77, 78]  ...         [1853189966]

if my assumptions are incorrect or this is not what you're looking for let me know.

Upvotes: 1

Related Questions