Jacob J. Walker's Blog

Scholarly Thoughts, Research, and Journalism for Informal Peer Review

The Introduction/Background to my Revised Doctoral Research Proposal

with one comment

Compendium_of_Countries_LogoToday, I finished revising the first part of my doctoral research proposal, as there have been several underlying methodological and technological changes from the original proposal.  While I know doctoral research is usually not of general interest, I am still going to be posting the sections of my revised proposal as I finish them, for those who are interested.  Please feel free to ask questions if you have them, and I will do my best to explain statistical techniques or the technology, etc. that I’m talking about.

1.1.           The Introduction/Background

It is the goal of the proposed research to use data mining methods over a large set of international data to discover correlations that are yet unknown, with a focus on seeing which educational factors of a nation connect with which other aspects of the nation; and with the hope that the discovery of new correlations may lead to further research that can help improve the human condition.

While there has been and likely will continue to be debate over what it means to “improve the human condition” due to the is-ought problem of philosophy (Hume, 1739), and there is also a philosophical debate over the nature of causation in the social sciences (Rosenberg, 2015), there does seem to be agreement that in order to be able to make changes that can improve the human condition, we must determine what influences different aspects of the human condition. And in science, this traditionally starts by making a hypothesis about a potential correlation between variables.

Until recently the process of hypothesis creation was generally considered something that came only from intuition and could not be derived through a methodological process (Noé, 1998). Richard Feynman was one such person who believed the human intellect was crucial to the hypothesis creation process, but once joked that there could be a machine set up with a random wheel that could make a succession of guessed hypotheses, and then automatically test them (Feynman, 1964). And while this was originally said in jest, now with the computerization of science, this idea is being put into practice, using “Knowledge Discovery in Databases” (KDD), also known as “data mining”, that is part of the fields of “discovery science” and “data science”.

The development of KDD / data mining methods have been a direct result of an exponential increase in data often called the “information explosion” in tandem with the computerization of this data and the continual increase in processing power of computers. As an early article about KDD said:

Across a wide variety of fields, data are being collected and accumulated at a dramatic pace. There is an urgent need for a new generation of computational theories and tools to assist humans in extracting useful information (knowledge) from the rapidly growing volumes of digital data. These theories and tools are the subject of the emerging field of knowledge discovery in databases (KDD). (Fayyad, Piatetsky-Shapiro, & Smyth, 1996)

Data science methods, including data mining, have gained importance in the natural sciences and the applied / commercial social sciences. For instance, in physics, the Large Hadron Collider that helped to discover the Higgs Boson, produces tens of petabytes per year, requiring sophisticated data science methods to analyse. In biology / medicine, the human genome is so large that many discoveries are also using data science. And businesses are clamouring to take advantage of their “big data” through the use of data science methods (Barth, Earley, Lawson, & Hall, 2013).

Yet the field of education is only starting to catch up with where the natural sciences and private industry have already gone. For example the International Educational Data Mining Society was only founded in 2011, and the Journal of Educational Data Mining has only published 10 editions to date.

The lack of data science methods being applied to education and other social sciences, probably is due to both a lack of large scale data sets in which such techniques are necessary, and also potential ethical issues stakeholders may have with data mining being used on databases of personal information.

But, this lack of research using data science methodologies provides an opportunity for the proposed research, to make a new and significant contribution to the field of comparative education. And because this research plans to use “crowd-sourced science” to help gather existing aggregated data, it will be able to overcome some of the common hurdles to using data mining techniques.

The research will have two major stages: first to build a database of countries that contains a wide variety of attributes about each one, and then second to compare each attribute of each country with every other attribute of every other country to determine if a correlation likely exists.

The first stage will take advantage of the many international databases that already exist, including databases from the United Nations, the OECD, and the myriad of Global Performance Indicators (GPIs) that various organizations have developed and published. These will all be connected into a single database that will be a Compendium of Countries.

But, the amount of time required to do this is prohibitive for a single researcher to do on their own. So the research will employ the use of students to gather existing data and “wrangle” it so that it can fit properly into the Compendium of Countries.   Utilizing students for this project will also provide “more effect for the effort”, because it will also help the students to learn about data science methodologies and hopefully become more interested in science, in general.

The second stage of the project will be to have software automatically test statistical models across each attribute of the nations in the Compendium of Countries. This will produce a correlation coefficient for each type of model and each set of attributes. If this correlation coefficient is greater than a particular threshold that has been determined to signify that the relationship between the sets of attributes might be “interesting”, then it will be flagged for further investigation.

The results of the research will be the set of relationships that have been flagged, and some of the additional analysis that occurred, such as data visualizations and considerations about the potential for one attribute to be an influence on another. The possibility of “causation” (influence) will only be considered when one attribute clearly occurs previous in time to another factor, and that this pattern of correlation occurs repeatedly over a time series.

As a proof of concept, preliminary research has been conducted using data from The World Factbook and analysed in Microsoft Excel. This preliminary work has been invaluable, as it has shown that using traditional linear regression methods to find a correlation coefficient on the data (Pearson’s r) or the rank data (Spearman’s rho) will often lead to Type I errors due to overfitting caused by the underlying data not having a Gaussian distribution. For example, the coefficient of determination (r2) of the number of airports to Internet hosts in a country is 0.89, but upon investigation, this high value was mainly due to the fact that both the number of airports and the number of internet hosts have a non-linear distribution that is closer to a power law (or similar) probability density function. By using Kendall’s rank correlation method instead, it was determined that the τ for these two variables was 0.36, which while still showing some potential correlation, is dramatically weaker than the coefficient of determination would make it appear.

But the preliminary research has also already shown some potential educational correlations that might be of interest. For instance, there was found to be a relatively strong linear correlation between the expenditure on education per capita and the expenditure on healthcare per capita (τ = 0.800 and r2= 0.796) between countries. In this particular correlation it is not likely that one of the variables is causing any change in the other variable, but probably that as a country becomes wealthier (which can partly be seen in GDP per capita), the residents will invest more in both education and healthcare at a relatively consistent ratio.

But, even when causation is not likely, the knowledge gained is valuable, and can lead to finding sources of causation. This is why the proposed research is important: it can open the door for many other studies to dive into what it initially discovers. And, these additional studies may be able to fuel improvements for developing nations.

Post Revisions:

Written by Jacob Walker

February 20th, 2016 at 6:12 pm

One Response to 'The Introduction/Background to my Revised Doctoral Research Proposal'

Subscribe to comments with RSS or TrackBack to 'The Introduction/Background to my Revised Doctoral Research Proposal'.

  1. May I suggest you get a graduate degree in Law, Business (MBA), or even Public Administration (MPA) instead. A PHD in Education can label you as being a bit weird. Like being the guy who can’t do, but who can teach. It’s that old line from Woody Allen that has an element of truth to it. At least you don’t want get a PHD in Gym Teaching.

    Here’s a link for you: https://www.youtube.com/watch?v=N5uKUg7FLq4

    Hans Christian Hemmingstein

    18 Jun 16 at 5:39 pm

Leave a Reply

%d bloggers like this: