Data Mining FTW
Today I want to take a few minutes to talk about mining. No, I don’t mean that rock business with the explosions and the funny little rail carts. I’m talking data mining which does not have any explosions, though I am sure some analysts out there would be very happy to try using them. In the data science and analytics world, data mining is pretty much the workhorse for a lot of business needs. The term is a pretty large umbrella for a group of methodologies that can help to garner insights from sets of data often with some business decision driving the intent.
One of the most famous (although mostly false) stories in the data mining world has to do with the unlikely combination of beer and diapers. If you have spent any time in the world of data science or analytics you’ve heard this one. The story goes that a large supermarket chain began to analyze their customer demographic and point of sale data to uncover what items are being bought, when, and by whom. Many of the items made sense when bought together like bread and deli meat or a mop and bucket. One combination however seemed to be very out of the ordinary. On Friday evenings, young males had a strong proclivity for buying diapers and beer in the same transaction. The idea being that young men who are used to going out and “Raise Cain”, as my grandmother used to say, would not be able to do so with a young baby at home and thus resorted to buying beer from the store when they went to pick up their weekly dose of diapers. Now, as I said, most of this story is hogwash except that in 1992, a group did find that there was a strong correlation in the purchase of diapers and beer on Thursday and Fridays in a Midwest drug store chain though there was no information about the timing or demographics of the purchasers.
Regardless of the fact that the story is not really true, the point here is that data mining is not always about making hypotheses based on assumptions and then testing those hypotheses with data analysis. Data mining is also about finding correlations, connections and interesting trends that may not always be apparent to the people asking to know more. Data mining techniques are often built with no or very few assumptions and can bring to light many useful things that people would otherwise have no idea about.
One of the most interesting and emerging areas that data mining and data science in general is in the world of healthcare. There are a number of conditions that are well known occur in the same person either because they are caused by the same things or one condition causes another. These trends have been tracked and recorded over hundreds of years and are a big part of why doctors have to spend so much time learning and studying until they are set free on the world to practice. These trends are important and help doctors and other healthcare providers make sure to watch for warning signs or proactively treat for conditions that are likely to appear.
What if there are correlations that doctors can’t see because they aren’t looking for them? Say for instance the result of a blood test comes back normal. The doctor is likely, and correctly, not to be worried about the result. However, what if that test result that appears normal is actually a warning sign for another condition when it is taken into the context with all of the other medical parameters of the patient? Wouldn’t that be a good thing to know? Of course it would be. And that is exactly what a number of researchers are attempting to do.
Now if this test did raise a red flag and you were given a proactive treatment to correct the problem and the condition never surfaced or was much less severe when it did surface, that is a huge deal. Not only for you, because you know you’re still healthy or healthier than you would have been, but for your bank account. On the much larger scale it’s not just your money involved here, it’s your insurance company or the government or whomever is helping to pay your medical costs. Now if they are saving money that means that you are also saving even more money because these sorts of things “trickle down” in a Keynesian way.
As we apply data mining on a broad scale across the healthcare domain there will be more and more occurrences like this detected and leveraged in order find savings and increase overall health of the population. This will be vitally important in a world where the outcomes of patient’s treatment are the basis for measuring the quality of our providers. If a provider can preemptively stop something from happening at a much lower cost than required after the fact, then they have saved money, increased their quality metrics, and had a very positive outcome for their patient. That, for those keeping score, is what I like to call a win-win-win.
Data mining has come a long way in recent years and the staggering amount of information that is collected every minute of every day about people all over the world is enough to really make you stop and think. There are always more robust, and frankly awesome, techniques being developed every day and each one might have an impact on your health in the future. The world of data science is always growing and evolving and is quite interesting (at least for me) to watch. I’m still waiting for the explosions though. Pretty sure that one is a long time coming.
This is a great article Kevin! The fact that you mentioned about how Data Mining is also about connections and interesting trends and not just the hypothesis, should be told more often. All the topic modelling techniques ( like LSA, LDA) have so many applications in Data science field. Would like to see how they can be used in healthcare field to find out useful information!