Jacob J. Walker's Blog

Scholarly Thoughts, Research, and Journalism for Informal Peer Review

Data Surfing: The Oft Forgotten First Stage of Discovery

with one comment

You got to drift in the breeze before you set your sails. It’s an occupation where the wind prevails. Before you set your sails drift in the breeze.” – Paul Simon

Many texts about data science (including machine learning, data mining, and predictive analytics) don’t include much about the very first step of the process, which is the step where you come up with what your goal is for your other steps.  In traditional science, this might be called the step of making your hypothesis.

This step is often not talked about much because it is the least formalized stage of the process.  It is a step that I call “data surfing”. (Although this term is not extremely common, it has some past precedence).  Data surfing is the step where the data scientist has learned in their past about something where they have gained “domain expertise”.   This process is often not entirely planned, although formal education may be part of it; just as likely it has included a lot of “surfing around” where one finds knowledge that is interesting to them.

It is also a very important stage for teaching about data science.  It is the stage that helps foster curiosity in students, which is critical to all science and scientific thinking.

After a data scientist has spent enough time just surfing and learning about that which they are interested in, they will ultimately need to move on to having a more clearly defined goal of what they want to have the data science process accomplish.  And to accomplish this, there will next be a need for some data wranglin’…

 

Post Revisions:

This post has not been revised since publication.

Written by Jacob Walker

May 9th, 2017 at 11:59 am

One Response to 'Data Surfing: The Oft Forgotten First Stage of Discovery'

Subscribe to comments with RSS or TrackBack to 'Data Surfing: The Oft Forgotten First Stage of Discovery'.

  1. To quote you, you said:

    “After a data scientist has spent enough time just surfing and learning about that which they are interested in, they will ultimately need to move on to having a more clearly defined goal of what they want to have the data science process accomplish. And to accomplish this, there will next be a need for some data wranglin’…”

    It just doesn’t work that way in real life. Surfing is the least productive way to design effective databases. I spent many years designing and implementing many super-large Data Warehouses on multiple DBMS platforms including Oracle, IBM DB2, Teradata, and MS SQL Server, on both IBM Mainframes as well as on Linux/HP-UX/Sun-Solaris Servers and I can tell you for a fact that…

    (1) The most successful Data Warehouse projects ALWAYS are designed according to the needs of functional business experts who have a clear and specific set of problems that need to be resolved immediately. A successful database will first model the existing business in a generalized flexible way, and then will have targeted applications written against it that deliver immediate results. Overtime the database is then performance tuned for critical apps that may run slow, often by creating new datamarts that are fed by the warehouse.

    Typical questions focus on marketing more than anything else and include questions such as what sales are we loosing to our competition? What new products are growing the fastest and which consumer demographic groups are buying them? Which products are on the way out with demand consistently trending lower, and so forth.

    (2) The least successful Data Warehouses have been those that have been sold as a magic cure-all by IT without any business involvement at all in their design, with the mindset that somehow functional managers will love the warehouse once it is developed, and will somehow figure out some use for after they have a chance to “surf” it. In these cases, the senior IT manager pushing the idea is almost always an empire builder who develops the monstrosity for the purpose of furthering her career rather than for the purpose of meeting the needs of the business.

    (3) Data Mining Tools such as Oracle’s Business Intelligence, DM tools from Business Objects, or DM tools from MicroStrategies, are difficult to learn and are almost always handed over to people who make the tool their full time job. In other words, “surfing” is not something that is done casually by a business manager but rather is done by a “tool expert” under circumstance of substantial investment in expensive human capital, an expense which has to be cost justified upfront by identifying the immediate payback and timefame for delivery for real results. The idea that the project will somehow pay for itself later under unknown circumstances and be delivered without any deadline just doesn’t fly for most companies.

    Surfing Fantasy

    14 May 17 at 10:20 am

Leave a Reply

%d bloggers like this: