Reputation: 381
I have a table of observations, or rather 'grouped' observations, where each group represents a deal, and each row representing a product. But the prediction is to be done at a Deal level. Below is the sample dataset.
Sample Dataset :
df = pd.DataFrame({'deal': ['deal1', 'deal1', 'deal2', 'deal2', 'deal3', 'deal3'],
'product': ['prd_1', 'prd_2', 'prd_1', 'prd_2', 'prd_1', 'prd_2'],
'Quantity': [2, 1, 5, 3, 6, 7],
'Total Price': [10, 7, 25, 24, 30, 56],
'Result': ['Won', 'Won', 'Lost','Lost', 'Won', 'Won']})
My Approach:
Flatten the data to get one observation per row using pivot_table
, so that we get one row per Deal, and then proceed with the classification modelling, probably a logistic regression or gradient boosting.
But in the above case we had: 1 column (product, with 2 unique values) to be pivoted 2 measures (Quantity and Price) as the series/values.
resulting in 4 columns. The Wide format table is shown below:
Question/Problem/Thought:
Is this always the best way in cases like these? The problem (or maybe not?) I see is when number of columns to be pivoted is more than 1 and also if its combination of unique values in it is more, the table may get very very wide!
I would be grateful to hear alternative efficient ways to prepare the dataset to train, if any!
Upvotes: 1
Views: 96
Reputation: 19
This is a small example but if it was bigger, yes that widening is bad since it becomes messy. It also would force you to add a new column for every new product type.
The best approach is to find a set of keys/columns that together give a functional relation to the other columns and where the other columns depend completely on the chosen keys/columns.
In this case choosing the two keys/columns {deal, product} can be chosen for this. It would then be called a candidate key. A candidate key is a set of keys which together are unique in each row. When you have done that there will be some properties of this candidate key and between this candidate key and the other keys.
The other thing to do is to break the table into more tables. It is good to have each column depend on the whole primary key. In the case of {result} it has a functional dependence on only {deal} and not {product}. Writen as {deal} -> {result}
Upvotes: 0
Reputation: 1
This is on the face of it good since you are creating a primal key as it is called. You have made 'deal' to a primary key. When it comes to database normalization it is always good to have a primary key and 'deal' is a good choice of primary key since they are never repeated. They are never repeated since you have named them with increments.
You are mentioning that there might be more types of objects traded in the deal and that therefore the table might become wider. If this is true it is then better to not spread the table like this. What you need to do is to make your data into 2NF. You do this by doing two things:
This would make you have two tables as follows: enter image description here
It is really good to divide up the table into two tables since it avoids deletion anomalies. That being the lack of information about 'win status' if all the trades in the deal is deleted.
I recommend watching: https://www.youtube.com/watch?v=4bTq0GdSeQs where he is talking about exactly this problem in his inventory example.
Upvotes: 0
Reputation: 9
This seems like a very valid approach, since you are saying that the predictions are to be done on a "deal" level, and not on the individual "deal + product" level.
It shouldn't be a problem if the table gets wide, if it is actually the way you need to do predictions at inference time.
If you could group your deals depending on the features used (e.g. which combinations of products are within the deal), you might be able to create multiple smaller models instead of a single large one.
Upvotes: 1