## Archive for the ‘Data Science’ Category

## Why I am choosing to use SageMathCloud as the platform for my Doctoral Research

In the very first draft of my doctoral research proposal, I planned on using Excel to do the data mining involved in my research. But, it quickly became apparent that Excel would be a poor tool for such a job, because it is inefficient for such tasks, does not scale well, is not multi-user, and can become unstable with large data sets. So I started to look at what might be a good alternative, and SageMathCloud seems to be the nearly perfect answer.

## My Re-Discovery of SageMath

I recently rediscovered SageMath (formerly just called “Sage”) whose mission is to create a viable free open source alternative to Magma, Maple, Mathematica and Matlab. And I think there is a good chance I’m going to be doing a lot more with it, because it is pretty cool.

## Humps and Tails: A Call for a New Paradigm

Anyone who has taken a basic statistics course*, has learned about finding “central tendencies”, by looking at the average (mean), the median, and how much things vary from these points (the variance and standard deviation). This works great when you have data that has a “hump”, like the “normal” (Gaussian) curve, which is generally based upon the addition of random events. But I am coming to believe that the paradigm of looking at things as being Gaussian, in so much that we call it “normal”, is preventing us from seeing how common “long tail” distributions are which have no hump. And this has caused mathematicians to potentially go through many contortions (like this picture I found on the Internet) to try and use formulas and concepts from Gaussian distributions to explain them. Maybe we need new and different measures when we see a long tail distribution, as “mean”, “median” and traditional definitions of standard deviation from the mean might be *mean*ingless.

* – Or even those who have learned a bit about stats in other math classes. For example, one good thing about the Common Core Math Standards, is a much greater inclusion of statistics.

## Some Initial Thoughts on using Least Median of Absolute Deviation for my Data Mining to Reduce Problems with Outliers

A while back I wrote on this blog a “cry for help” about some different forms of linear regression… which given the fact that it was a kind of deep topic in statistics and most of my friends and colleagues are not uber statistics nerds, I didn’t really get any replies… But I have persevered, and continued to dive in on my own, because as Khan Academy puts it, struggling with ideas improves the brain like lifting weights.

## A call for help about understanding Ordinary Least Squares (OLS) vs. Orthogonal Distance Regression (ODR) vs. Robust Regression

Just when I think I have my underlying mathematical knowledge sufficiently wrapped up to start to write Python code for my doctoral research, I find that there are new questions… In the current case, I was originally going to try and determine the strength of a linear or non-linear correlation by using Ordinary Least Squares, which is usually what is used to find the Coefficient of Determination. But when I started to look at regression functions in SciPy, I ran across Orthogonal Distance Regression (ODR), and when I started to try and research ODR more, I ran across the concept of robust regression. Now I’m trying to understand both of these concepts more, and I could use some help from someone who really understands this stuff, and can explain it in a more conceptual manner, so I can determine which statistical method is most appropriate for my research. Here is what I believe I understand so far:

## A “TED Talk” Explanation of my Doctoral Research

As I work to enter back into a doctoral program with UNISA, I have realized that I haven’t yet had a quick cohesive explanation of why it is so important to me to do the research I am doing. So here is that attempt to explain why I believe what I’m working on will make a significant contribution to the field of education, and beyond that, why it could be something that truly changes the world, and also how others can get involved. Who knows, maybe some day this will become a real TED Talk 🙂

## Thought of the Day: Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.

As part of my data science self-study, I was reading Flaws and Fallacies in Statistical Thinking, and ran across the quote by H.G. Wells: “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.” Since I know many quotes (even those in textbooks) are at least partially apocryphal, I searched, and found the original quote to be:

## Thought of the Day: “All Models are Wrong, but some are Useful” – George E. P. Box

While this was written in the 1987 book, Empirical Model-Building and Response Surfaces (and discovered by me today in the Economist Espresso App), which focused on statistical model building, its pragmatic message has much truth: the value of a model is how well it can be used in context.

## Python Script to Automate Refreshing an Excel Spreadsheet

Often I run into situations where it makes sense to do analysis of a lot of database data in an Excel spreadsheet, but due to the amount of processing the spreadsheet requires when updating, it takes a long time for the spreadsheet to “Refresh All”.

One solution to this problem is to automate the spreadsheet so it refreshes every night. The following is a small Python script that can do this using the Python for Windows Extension:

## My #EdDataSci Self-Study for 2014-2015

With me mostly dropping the #DALMOOC course, I have been thinking about how I’m going to continue studying educational data science on my own. And I can see I need to hone up my mathematical and statistical knowledge along with applying this knowledge to data science topics.

Since one of my major life strategies is to get “more effect for my effort” (more bang for the buck), I will be working on putting together a LearningCounts portfolio as I do my self-study, so that I can get additional college math credits. Also, I have been finalizing my concept for my doctoral research with UNISA in the field of Comparative Education, and I will be using data science / data mining methods as the major portion of my research. (I will share more about this in a future post)

So the following are the courses that I plan to self-study such that I can meet their learning objectives: