Reputation: 8903
The following is one column of a dataset that I'm trying to feature engineer:
+---+-----------------------------+
|Id |events_list |
+---+-----------------------------+
|1 |event1,event3,event2,event1 |
+---+-----------------------------+
|2 |event3,event2 |
+---+-----------------------------+
There are 3 possible event types and the order which they arrived is saved as a string. I've transformed the events column like so:
+---+--------------------+
|Id |event1|event2|event3|
+---+--------------------+
|1 |2 |1 |1 |
+---+--------------------+
|2 |0 |1 |1 |
+---+--------------------+
Preserving the count information but loosing the order information.
Q: is there a way to encode the order as a feature?
Update: for each row of events I calculate a score for that day, the model should predict future score for new daily events. Anyway, my events order and count affects the daily score.
Update: My dataset contains other daily information such as sessions count etc’ and currently my model is an LSTM digesting each row by date. I want to try and improve my prediction by adding the order info to the existing model.
Upvotes: 1
Views: 253
Reputation: 14062
One option is to translate/transform the string directly by creating a meaningful mapping 1 --> 1 (i.e. one to one). In this case, preserving order is doable and has a meaning.
This is simple demo:
data = ['event1,event3,event2,event1', 'event2,event2', 'event1,event2,event3']
def mapper(data):
result = []
for d in data:
events = d.replace(' ', '').split(',')
v = 0
for i, e in enumerate(events):
# for each string: get the sum of char values,
# normalized by their orders
# here 100 is optional, just to make the number small
v += sum(ord(c) for c in e) / (i + 100)
result.append(v)
return result
new_data = mapper(data)
print(new_data)
Output:
[23.480727373137086, 11.8609900990099, 17.70393127548049]
Although the clashes probability is very low, there is no 100% guarantee that there will be no clashes at all for gigantic dataset.
Check this analysis:
# check for clashes on huge dataset
import random as r
import matplotlib.pyplot as plt
r.seed(2020)
def ratio_of_clashes(max_events):
MAX_DATA = 1000000
events_pool = [','.join(['event' + str(r.randint(1, max_events))
for _ in range(r.randint(1, max_events))])
for _ in range(MAX_DATA)]
# print(events_pool[0:10]) # print few to see
mapped_events = mapper(events_pool)
return abs(len(set(mapped_events)) - len(set(events_pool))) / MAX_DATA * 100
n_samples = range(5, 100)
ratios = []
for i in n_samples:
ratios.append(ratio_of_clashes(i))
plt.plot(n_samples, ratios)
plt.title('The Trend of Crashes with Change of Number of Events')
plt.show()
As a result, the less events or data you have, the lesser the clashes ratio, until it hits some threshold, then it flats out -- However it is after all not bad at all (Personally, I can live with it).
Update & Final Thoughts:
I just noticed that your are already using LSTM, thus the order extremely matters. In this case, I strongly suggest you Encode events into integers, then create a time series which perfectly fits in the LSTM, follow these steps:
You will end up with something like this:
'event1' : 1
'event2' : 2
'event3' : 3
. . .
'eventN' : N
and for 'event1,event3,event2,event3', it will become: [1, 3, 2, 3]. Scaling --> [0, 1, 0.5, 1].
The LSTM then is more than capable to figure out the order by nature. And forget about the dimensionality point, since it is LSTM which it is main job is to remember and optionally forget steps and orders of steps!.
Upvotes: 1
Reputation: 524
One possibility could be a series of vectors representing the events that had occurred up until the nth
event. n
is the maximum number of events that can occur, the vector length is the number of possible events. This would implicitly encode the order of the events into a fixed size feature space.
+---+-----------------------------+
|Id |events_list |
+---+-----------------------------+
|1 |event1,event3,event2,event1 |
+---+-----------------------------+
|2 |event3,event2 |
+---+-----------------------------+
+---+--------------------------------------------------+
|Id | events_1 events_2 events_3 events_4 events_5 |
+---+--------------------------------------------------+
|1 | [1,0,0] [1,0,1] [1,1,1] [2,1,1] [2,1,1] |
+---+--------------------------------------------------+
|2 | [0,0,1] [0,1,1] [0,1,1] [0,1,1] [0,1,1] |
+---+--------------------------------------------------+
OR
fewer feature dimensions with the same information, would be to record which event occurred at event-step, n
+---+--------------------------------------------------+
|Id | event_1 event_2 event_3 event_4 event_5 |
+---+--------------------------------------------------+
|1 | 1 3 2 1 0 |
+---+--------------------------------------------------+
|2 | 3 2 0 0 0 |
+---+--------------------------------------------------+
This has fewer dimensions which is good but the possible disadvantage of not explicitly encoding final state. not knowing anything about the problem itself or which model you plan to use, it's difficult to know if that will matter or not.
Upvotes: 0