Quickbeam2k1
Quickbeam2k1

Reputation: 5437

Data Preparation: Alternatives to constructing a table using sql resulting in many columns

I'm trying to fetch data from our companies dwh-server and use some machine learning algorithms on those data.

(Un)-fortunately the data I want to aggregate reside on different tables. E.g. I have demographic customer info, purchase behaviour and received advertisements.

Furhtermore, advertisements are classified into two classes, as well as products a custome can purchase. I need to construct a mapping for each customer on advertisement to purchases (depending on time after advert is delivered). Hence, I know if the advertisement is succesful or not.

Additionally, to build a machine learning algorithm, I need to assign purchases and recevied adverts BEFORE I send out a new advertisement and summarize those data in aggregated form. Both, recevied adverts and purchases are subject to seasonality.

Hence, it might be wise to aggregate data (like net profit, number of purchases, etc.) on seasons, previous seasons, the seasone before the previous seasons and so on. Furthermore, recall that products are classified into two categories. And the number of aggregates/columns is doubled.

I think you can sense where this is heading:

Is there an efficient alternative to fetch data from sql for machine learning avoiding lots of columns?

Of course, I could postpone the aggregation and combine step to R or python on my local machine, but typically the sql server should be much faster.

Currently, my table shall contain roughly 400 columns and queries haven't finished afert 80 minutes on an sql server.

Maybe I should add the following: In the final aggregation step I'm using many statements of the form:

SUM(IIF(PRODUCTTYPE = 1, NET, 0)) AS TYPE1NET
SUM(IIF(PRODUCTTYPE = 1 AND SEASON = 'WIN', NET, 0)) AS TYPE1NETWIN

I don't see a place where I could calculate this earlier on.

Upvotes: 0

Views: 97

Answers (1)

Denis Kulagin
Denis Kulagin

Reputation: 8937

I work a lot with the data and sometimes volumes are just huge. Dumping and transforming data into a form consumable by ML algorithms is a painful and usually time-consuming process.

I tried using RDBMS capabalities to accomplish the task, but had failed in most of the cases. The main reason: they are not fit for the task (e.g. they don't support streaming of the data).

The best way in my practice is to dump data into CSV format; feed to to some key-value engine (HBase would do, also used simple Java-based solution of my own cooking). Then choose your favourite programming language and iterate over a set, constucting your data and persisting in to the disk.

Upvotes: 1

Related Questions