October 11, 2008   Sign In |  About ebizQ |  Contact Us |  Join ebizQ Gold Club
James Taylor
James Taylor's Decision Management
James is one the leading experts in enterprise decision management, a published author and a principal of Smart (enough) Systems LLC. His blog discusses the use of decision management technologies like predictive analytics and business rules to deliver agility, improve business processes and bring intelligent automation to SOA.

« Decision Management, Decision Services and BPM | Main | Book discounted on amazon.com »

November 14, 2007
An introduction to data mining and predictive analytics

I attended a nice little introductory webinar on data mining from Salford Systems today. I won't try and summarize the whole presentation but it made a number of good points.

  • Data is key and must typically be provided as a single "table" where each row is a single instance or transaction.
    This normally means a flat file generated from a relational structure of some kind. Multiple tables are often collapsed to generate additional columns so that, for instance, monthly activity for the last year is moved from a separate table to 12 columns in the dataset. Data mining and predictive analytic tools overwhelmingly do not support the relational model.
  • They differentiated between attributes (data from your database) and features (something calculated from this for use in data mining
    There are other terms used for this but the creation of these calculated attributes is often the most important and tricky part of building a predictive model. Indeed a whole class of software exists to help you find the most predictive of the possible candidates.
  • Dates are not terribly useful
    Absolute dates that is. Instead time since an event (such as subscribing) or the time between two dates are more common and more useful.
  • Data mining is not the same as BI or OLAP
    Something I have said before
  • You need to split your data into three to build good models
    The main dataset is used to build various models, the text dataset is used to find which one(s) work best and the validation set is used to confirm that the model works well even for data it has never "seen" before.
  • Models are obsolete when deployed
    A slightly extreme POV quoted in the presentation but a valid one. Models start to age as soon as they are finished and so must be revised and kept up to date. Understanding how often they must be updated to remain relevant is a critical consideration for any model.
  • Target variables drive supervised models
    If you know what you are trying to predict then you are in supervised modeling and either looking for a number within a range (regression modeling for instance) or for clusters/classifications. Without a target variable you are in unsupervised modeling and "don't know what you don't know".
  • Data preparation is often the biggest and most important task
    Getting the data cleaned up and ready to use can be 90% of the work. Handling missing values, fields with too many options and many other things can upset input data. Get this right or the model will suffer

  • I thought they could have talked more about the deployment of models and the combination of them with regulations and policy (business rules) but overall I really like the webinar. If you are looking for an introduction, check it out. They promise to have a recording some time soon.
    The folks at Salford Systems run this webinar periodically. The next one is on December 13th (details here) and the whole schedule is kept here..

    Posted by jtaylor in Predictive Analytics |Digg This|Add to del.icio.us

    Trackback Pings

    TrackBack URL for this entry:
    http://www.ebizq.net/mt/mt-tb.cgi/2874

    Comments

    Your review of data mining is excellent. I'd just make one clarification regarding "You need to split your data into three to build good models". Predictive models indeed require validation, but the train/tune/test process is not the only one. For a variety of reasons (for example, small number of observations), the data miner may instead elect to use other test procedures, such as k-fold cross-validation or bootstrapping, which are just as rigorous.

    Posted by: Will Dwinnell at December 17, 2007 01:35 AM

    Post a comment




    Remember Me?

    (you may use HTML tags for style)

    We ask that you type your code (displayed below) in the text box.This code is an image that cannot be read by a machine. It prevents automated programs from submitting comments.


    Code:



Most Recent ebizQ Blog Entries
ADVERTISEMENT
This Work
Accountability:The opinions expressed in this blog are solely representative of the blog's author, and not of ebizQ

Subscribe to our Newsletters
ebizQ Weekly Gold Club Update
Live Webinar Updates
Updates from ebizQ Partners
ebizQ SOA Update
ebizQ BPM Update
ebizQ Security Update
ebizQ BI Update
ebizQ Open Source Software Update
Virtual Show Newsletter
ebizQ Web 2.0 and the Enterprise
Your E-mail Address:
Enterprise Service Bus: The case for 'e'SBs
Date: Oct 16, 2008
Time: 14:00 PM ET
(18:00 GMT)

REGISTER TODAY!
BPM for Insurance: Are You Staying Competitive?
Date: Oct 28, 2008
Time: 12:00 PM ET
(16:00 GMT)

REGISTER TODAY!
Archived Webinars | Upcoming Webinars

Marketing Solutions | Feedback | About ebizQ | Unsubscribe | Privacy Policy | Site Map

Live Chat