I loved this article by Jeff Leek on how the era of data has evolved over time.
- The era of not much data This is everything prior to about 1995 in my field. The era when we could only collect a few measurements at a time. The whole point of statistics was to try to optimaly squeeze information out of a small number of samples - so you see methods like maximum likelihood and minimum variance unbiased estimators being developed.
- The era of lots of measurements on a few samples This one hit hard in biology with the development of the microarray and the ability to measure thousands of genes simultaneously. This is the same statistical problem as in the previous era but with a lot more noise added. Here you see the development of methods for multiple testing and regularized regression to separate signals from piles of noise.
- The era of a few measurements on lots of samples This era is overlapping to some extent with the previous one. Large scale collections of data from EMRs and Medicare are examples where you have a huge number of people (samples) but a relatively modest number of variables measured. Here there is a big focus on statistical methods for knowing how to model different parts of the data with hierarchical models and separating signals of varying strength with model calibration.
- The era of all the data on everything. This is an era that currently we as civilians don’t get to participate in. But Facebook, Google, Amazon, the NSA and other organizations have thousands or millions of measurements on hundreds of millions of people. Other than just sheer computing I’m speculating that a lot of the problem is in segmentation (like in era 3) coupled with avoiding crazy overfitting (like in era 2).
What is interesting to me is that how this will impact the world of analytics thro' application of new methodologies like AI and Machine Learning is one thing. The other one that I see is that how does this 'mass of data that is being generated' represent the right population that one is developing insights on. There is a lot of potential biases that can happen given the kind of people who have access to the net.
So, the future is about an era of mixing lot of samples offline along with a lot of data that is being generated online. The power of data fusion techniques will be required to build meaningful insights and predictive actions by various industries across the world.