Jacob J. Walker's Blog

Scholarly Thoughts, Research, and Journalism for Informal Peer Review

Data Mining: Discovering Gold in your Data

with one comment

There’s gold in dem dere data!” – Adaptation of the original quote from M. F. Stephenson

After the data has been gathered and in a form that can be used, it can then have an appropriate algorithm used to accomplish the data mining/machine learning/predictive analytics. This is the stage that traditionally has been called “data mining” because it is the part that gets additional value from the data in the form of some type of knowledge (this is why early on, the process was sometimes called “knowledge discovery in data” (KDD).

Often the algorithm is referred to as a “model” in other texts, because statistical models are commonly used, and in all cases our understanding of the underlying data is really just a model, and that model may or may not produce the answer we are looking for as well as other models.  But, I will use the word “algorithm”, because some techniques involve complex algorithms where we don’t necessarily have a clear mental model in the end about why the answer came out as it did.  Examples of where the data scientist might not have a complete understanding of the full underlying model include  neural networks (which utilize a system that kind of acts like how neurons act in the brain) and ensemble methods are used (which use multiple algorithms/models).

The data mining stage involves having sufficient knowledge of math/stat and computer science to use an algorithm that is going to get good results from the data you have, and often then programming this algorithm to be used.  (Although, often data mining software packages, or using libraries in a programming language will alleviate the data scientist from creating the full algorithm from scratch)

So how does a data scientist know if they have a good algorithm?  Usually the data scientist will do a test with past data to see if they had used their algorithm in the past, would it had made enough correct predictions?   Of course, once the algorithm is implemented, it will continue to be watched to see if it is currently making enough correct predictions.   Further, it is common for one algorithm to be used at first, and then another better one used later.   For example Netflix has improved its algorithm over time about how it guesses what other movies you might like to see.   (Although, what is interesting is that they had a huge contest to see who could come up with the best algorithm, and in the end it was an ensemble of a lot of other algorithms…  But the end method was so complex that Netflix decided not to use it.)

Once the software is making predictions/learning/categorizing/etc.  there is the need to do something with this knowledge, and this is the stage I will call “data artistry”.

 

Post Revisions:

This post has not been revised since publication.

Written by Jacob Walker

May 11th, 2017 at 11:59 am

One Response to 'Data Mining: Discovering Gold in your Data'

Subscribe to comments with RSS or TrackBack to 'Data Mining: Discovering Gold in your Data'.

  1. You said:

    “Often the algorithm is referred to as a “model” in other texts, because statistical models are commonly used, and in all cases our understanding of the underlying data is really just a model, and that model may or may not produce the answer we are looking for as well as other models.”

    Data Mining databases are different than Data Warehouse databases, and are designed using highly redundant redundant data storage in a model commonly referred to as a multi-dimensional data Cube that is optimized for query purposes but is substandard for update. In the old days, Cubes were referred to as Datamarts. The model first and foremost contains data facts rather than statistical computations, though common stats such as averages and standard deviations are often added to roll-up summary tables.

    The drill down process usually starts at the roll-up level where something is observed that needs to be further explained by drilling into the detail underneath the summary data. A summary tabel may be virtual or physical depending on how you have things set up. Obviously virtual tables perform slower but have the advantage that they can be implemented without too much additional physical design.

    Data Miner the Whiner

    14 May 17 at 10:42 am

Leave a Reply

%d bloggers like this: