Shlomi Schwartz
Shlomi Schwartz

Reputation: 8903

preserving order information in a single feature

The following is one column of a dataset that I'm trying to feature engineer:

+---+-----------------------------+
|Id |events_list                  |
+---+-----------------------------+
|1  |event1,event3,event2,event1  |
+---+-----------------------------+
|2  |event3,event2                |
+---+-----------------------------+

There are 3 possible event types and the order which they arrived is saved as a string. I've transformed the events column like so:

+---+--------------------+
|Id |event1|event2|event3|
+---+--------------------+
|1  |2     |1     |1     |
+---+--------------------+
|2  |0     |1     |1     |
+---+--------------------+

Preserving the count information but loosing the order information.

Q: is there a way to encode the order as a feature?

Update: for each row of events I calculate a score for that day, the model should predict future score for new daily events. Anyway, my events order and count affects the daily score.

Update: My dataset contains other daily information such as sessions count etc’ and currently my model is an LSTM digesting each row by date. I want to try and improve my prediction by adding the order info to the existing model.

Upvotes: 1

Views: 253

Answers (2)

Yahya
Yahya

Reputation: 14062

One option is to translate/transform the string directly by creating a meaningful mapping 1 --> 1 (i.e. one to one). In this case, preserving order is doable and has a meaning.

This is simple demo:

data = ['event1,event3,event2,event1', 'event2,event2', 'event1,event2,event3']

def mapper(data):
    result = []
    for d in data:
        events = d.replace(' ', '').split(',')
        v = 0
        for i, e in enumerate(events):
            # for each string: get the sum of char values,
            # normalized by their orders
            # here 100 is optional, just to make the number small
            v += sum(ord(c) for c in e) / (i + 100) 
        result.append(v)
    return result

new_data = mapper(data)
print(new_data)

Output:

[23.480727373137086, 11.8609900990099, 17.70393127548049]

Although the clashes probability is very low, there is no 100% guarantee that there will be no clashes at all for gigantic dataset.

Check this analysis:

# check for clashes on huge dataset
import random as r
import matplotlib.pyplot as plt

r.seed(2020)

def ratio_of_clashes(max_events):
    MAX_DATA = 1000000
    events_pool = [','.join(['event' + str(r.randint(1, max_events))
                         for _ in range(r.randint(1, max_events))])
                   for _ in range(MAX_DATA)]
    # print(events_pool[0:10])  # print few to see
    mapped_events = mapper(events_pool)
    return abs(len(set(mapped_events)) - len(set(events_pool))) / MAX_DATA * 100


n_samples = range(5, 100)
ratios = []
for i in n_samples:
    ratios.append(ratio_of_clashes(i))

plt.plot(n_samples, ratios)
plt.title('The Trend of Crashes with Change of Number of Events')
plt.show()

enter image description here

As a result, the less events or data you have, the lesser the clashes ratio, until it hits some threshold, then it flats out -- However it is after all not bad at all (Personally, I can live with it).


Update & Final Thoughts:

I just noticed that your are already using LSTM, thus the order extremely matters. In this case, I strongly suggest you Encode events into integers, then create a time series which perfectly fits in the LSTM, follow these steps:

  1. Pre-process each string and split them into events (as I did in the example).
  2. Fit LabelEncoder on them and transform them into integers.
  3. Scale the result into [0 - 1] by fitting MinMaxScaler.

You will end up with something like this:

'event1' : 1

'event2' : 2

'event3' : 3

. . .

'eventN' : N

and for 'event1,event3,event2,event3', it will become: [1, 3, 2, 3]. Scaling --> [0, 1, 0.5, 1].

The LSTM then is more than capable to figure out the order by nature. And forget about the dimensionality point, since it is LSTM which it is main job is to remember and optionally forget steps and orders of steps!.

Upvotes: 1

ezekiel
ezekiel

Reputation: 524

One possibility could be a series of vectors representing the events that had occurred up until the nth event. n is the maximum number of events that can occur, the vector length is the number of possible events. This would implicitly encode the order of the events into a fixed size feature space.

+---+-----------------------------+
|Id |events_list                  |
+---+-----------------------------+
|1  |event1,event3,event2,event1  |
+---+-----------------------------+
|2  |event3,event2                |
+---+-----------------------------+



+---+--------------------------------------------------+
|Id | events_1  events_2  events_3  events_4  events_5 |
+---+--------------------------------------------------+
|1  | [1,0,0]   [1,0,1]   [1,1,1]   [2,1,1]   [2,1,1]  |
+---+--------------------------------------------------+
|2  | [0,0,1]   [0,1,1]   [0,1,1]   [0,1,1]   [0,1,1]  |
+---+--------------------------------------------------+

OR

fewer feature dimensions with the same information, would be to record which event occurred at event-step, n

+---+--------------------------------------------------+
|Id | event_1  event_2  event_3  event_4  event_5      |
+---+--------------------------------------------------+
|1  |   1         3        2        1        0         |
+---+--------------------------------------------------+
|2  |   3         2        0        0        0         |
+---+--------------------------------------------------+

This has fewer dimensions which is good but the possible disadvantage of not explicitly encoding final state. not knowing anything about the problem itself or which model you plan to use, it's difficult to know if that will matter or not.

Upvotes: 0

Related Questions