Amit Kumar
Amit Kumar

Reputation: 2745

How to convert variable length string to vector?

I am working on classification algorithm and I get different string codes which have some pattern.

|:-----------|------------:|:------------:|
| Column 1   | Column 2    | Column 3     |
|:-----------|------------:|:------------:|
| MN009      | JIK9PO      | LEFTu        |
| MN010      | JIK9POS     | LEFTu        |
| MN011      | JIK9POKI    | LEFTu        |
| MN012      | KIJU        | LEFTu        |
| MN013      | RANDOM      | LEFTu        |
| MN014      | FT          | LEFTu        |
|:-----------|------------:|:------------:|

For column 1 and 3 the feature set can be a vector length 5.

But I do not know how to create feature set which can accommodate column 2 as well.

Considerations:

  1. Create a feature vector of size equal to size of longest string value and for smaller strings add some filler.
  2. Truncate strings to a fixed length like 5 here and ignore extra characters.

Hope I am clear with the question. Thanks :)

Upvotes: 1

Views: 1013

Answers (2)

Benedict K.
Benedict K.

Reputation: 854

Have a look at the docs, pack-padded-sequence helps you avoid dynamic graphs and allows the network to disregard padded input. This would be straight forward to implement.

Packs a Variable containing padded sequences of variable length.

Upvotes: 1

KonstantinosKokos
KonstantinosKokos

Reputation: 3453

There are two solutions:

  1. The one you mentioned; predefine a length, zero-padding sequences that fall short of it. This length can either be set to:

    • the longest currently present sample (larger feature space ⇒ time / memory complexity consequences),
    • or to a shorter length (information loss ⇒ predictive power penalty). Information loss stems from either ignoring sequences above that length or truncating them and using their cut-down versions.

      In both cases you should probably quantify the impact of your choice (i.e. how much information have I discarded from my data by discarding/truncating, or how much larger is my problem space compared to if I used a smaller length).

  2. Dynamic graphs, essentially variable shape networks, can handle sequences of different sizes. Such capacities are offered by PyTorch and are (relatively) straightforward to implement (related SO question)

Upvotes: 2

Related Questions