Reputation: 1
I’m working with a dataset for a project. Unfortunately, I don’t have access to detailed information about the variables (apart from the descriptions I'll provide below). After cleaning the data a bit (it was a real mess), I’m left with this dataset, which I’ll link here (along with a screenshot). One of the variables is called "margin," defined as "cumulative customer margin." This makes me think the variable should be in absolute value. Then, I have two other variables: "price" and "number of transactions." When filtering for number of transactions = 1, I’d expect the values of price to always be higher than margin (assuming margin = price - cost). However, I’ve found many anomalous values. I’m attaching a few examples in the screenshot. Any insights would be greatly appreciated! Here is the translation of the table:
Variable Data Challenge | Description |
---|---|
EVENT_ID | Transaction ID |
N_ITEMS | Total number of items purchased in the transaction |
PROP_CONBINI | Proportion of "conbini" articles in the transaction |
FAV_GENRE | Preferred manga genre |
PHONE_NUMBER | Customer's phone number available |
Customer's e-mail address | |
YEAR | Transaction year |
MONTH | Transaction month |
PAYMENT_TYPE | Agreed payment method |
BOOKS_PAID | Number of manga paid for in previous transactions |
PRICE | Transaction price |
N_SUBSCRIPTIONS | Number of active manga series subscriptions |
SUBSCR_CANC | Number of canceled manga series subscriptions in the past |
POINT_OF_SALE | Point of sale |
AGE | Customer's age |
DAYS_FROM_PROMO | Days since the last promo launch |
MARGIN | Cumulative customer margin |
N_TRANSACTIONS | Total number of transactions made by the customer |
CUSTOMER_SINCE | Date of the customer's first transaction |
DATE_LAST_PURCHASE | Date of the customer's last transaction |
PAID | Amount paid (target variable) |
Does anyone have any ideas? Am I missing something about the "margin" variable? I’ve also considered the possibility that it represents a relative value, but when I check the values of margin, there are many instances greater than 100, which doesn’t seem possible. i didn't find any significant pattern with the other variables This variable is crucial for me because I need to infer the average cost, which I plan to use as a weight for false negatives (I’m building a classification model for credit scoring, where 1 = pays, 0 = doesn’t pay). Any suggestions or insights would be incredibly helpful!
Upvotes: 0
Views: 16