The Stream of Scientific Revolutions
Posted by cadsmith on September 20, 2009
Due to social and sensor networks, it is estimated that data volume is doubling every 9 to 12 months. Analysis is required in realtime to derive knowledge from distributed databases. Awareness is improved by adding sources, e.g. the internet of things. The term data mining, reportedly coined by Robert Hecht-Nielsen a couple of decades ago, denotes automated fact-finding, knowledge discovery, rule inference and prediction activities. The field follows predecessors such as statistics, originally named for state demographics and economics, and machine learning. ACM dedicated a knowledge discovery and data mining group, KDD, in 1989.
In classic science, a hypothesis is often disproved by experiment, whereas in this case, tests yield a data-driven hypothesis. Patterns of interest are useful or novel, though most are not. More recently, the field picked up steam as analysis times for huge databases became excessive, disparate sources needed to be quickly connected and dimensionality, or number of attributes, expanded. These result in ways to assign meaning which leads to knowledge which is communicated as information assuming that errors are avoided or corrected. The result is better visualization and built-in database intelligence.
Government and security have been major proponents, e.g. for profiling. Other applications include biomed, insurance, physics, business intelligence, CRM, information retrieval, OLAP online analytical processing, text mining and analysis, finding experts, sports stats, and digital libraries. Besides software, tools include decision trees and neural networks. Models may be verified by splitting the data and verifying the equivalence of results on both parts.
Major tasks have been outlined as:
- classification, sequence detection, genetic algorithms, nearest neighbor, naive bayes classifier, logistic regression and discriminant analysis;
- affinity analysis, market basket, association analysis, rule learning, rough sets, and sequence detection;
- prediction, regression, and time series analysis forecasting;
- segmentation, cluster analysis, and kohonen networks;
A couple of the popular standards are CRISP-DM, cross industry standard process for data mining, and PMML, predictive model markup language.
There is plenty of software such as R, SAS SEMMA for sample explore modify model assess, SPSS, Netbase, Statistica, opensource Labkey, Rattle GNOME GUI, GNU octave, Weka-3, Apache Hadoop, Datalogic/R, Mozenda scraper. IBM, Oracle and Microsoft have offerings.
Other than usability, system integration and projections from prior knowledge, issues commonly revolve around privacy and performance. Congress has discussed consumer protections, though users are tracked from an increasing number of government social-network sites and cloud security standards are still in development. Data may be missing. Patterns may not be understandable. Noisy data can result in spurious patterns, though source correction is improving. Relationships between fields may be more complex than assumed.
Kdnuggets is a general resource site.
Also see bookmarks for media links.