« Decision Management, Decision Services and BPM | Main | Book discounted on amazon.com »
November 14, 2007An introduction to data mining and predictive analytics
I attended a nice little introductory webinar on data mining from Salford Systems today. I won't try and summarize the whole presentation but it made a number of good points.
- Data is key and must typically be provided as a single "table" where each row is a single instance or transaction.
This normally means a flat file generated from a relational structure of some kind. Multiple tables are often collapsed to generate additional columns so that, for instance, monthly activity for the last year is moved from a separate table to 12 columns in the dataset. Data mining and predictive analytic tools overwhelmingly do not support the relational model. - They differentiated between attributes (data from your database) and features (something calculated from this for use in data mining
There are other terms used for this but the creation of these calculated attributes is often the most important and tricky part of building a predictive model. Indeed a whole class of software exists to help you find the most predictive of the possible candidates. - Dates are not terribly useful
Absolute dates that is. Instead time since an event (such as subscribing) or the time between two dates are more common and more useful. - Data mining is not the same as BI or OLAP
Something I have said before - You need to split your data into three to build good models
The main dataset is used to build various models, the text dataset is used to find which one(s) work best and the validation set is used to confirm that the model works well even for data it has never "seen" before. - Models are obsolete when deployed
A slightly extreme POV quoted in the presentation but a valid one. Models start to age as soon as they are finished and so must be revised and kept up to date. Understanding how often they must be updated to remain relevant is a critical consideration for any model. - Target variables drive supervised models
If you know what you are trying to predict then you are in supervised modeling and either looking for a number within a range (regression modeling for instance) or for clusters/classifications. Without a target variable you are in unsupervised modeling and "don't know what you don't know". - Data preparation is often the biggest and most important task
Getting the data cleaned up and ready to use can be 90% of the work. Handling missing values, fields with too many options and many other things can upset input data. Get this right or the model will suffer
I thought they could have talked more about the deployment of models and the combination of them with regulations and policy (business rules) but overall I really like the webinar. If you are looking for an introduction, check it out. They promise to have a recording some time soon.
The folks at Salford Systems run this webinar periodically. The next one is on December 13th (details here) and the whole schedule is kept here..
Posted by jtaylor in
Predictive Analytics
|
Digg This|
Add to del.icio.us
Trackback Pings
TrackBack URL for this entry:
http://www.ebizq.net/mt/mt-tb.cgi/2874
Your review of data mining is excellent. I'd just make one clarification regarding "You need to split your data into three to build good models". Predictive models indeed require validation, but the train/tune/test process is not the only one. For a variety of reasons (for example, small number of observations), the data miner may instead elect to use other test procedures, such as k-fold cross-validation or bootstrapping, which are just as rigorous.
Posted by: Will Dwinnell at December 17, 2007 01:35 AM
Post a comment

James Taylor's Decision Management