In-Database Analytics Moves from Early Adopters to Mainstream Use

Brett Sheppard's picture

In-database analytics is helping to make advanced analytics of Big Data more affordable, scalable and faster to deploy. 2010 is becoming the year for in-database analytics to transition from early adopters in financial services, insurance and government to more extensive use in a variety of industries. MapReduce has become a popular extension to relational and column databases. SAS and Teradata are showing signs of progress with their in-database analytics collaboration; outdoor clothing and accessories supplier Cabela's provides a good case study. SAS competitors IBM/SPSS and KXEN as well as specialist firms such as Fuzzy Logix offer advanced analytics programs and tools that can run in-database.

By pushing complex analytics computations closer to the data, processing times are faster, and organizations require less server and network infrastructure to move data between storage and memory. To accomplish this, an increasing number of enterprises and public sector organizations are executing predictive analysis, data mining, and other computation-intense Big Data applications within their data warehouses.

According to Forrester senior analyst James Kobielus and his co-authors, “In-database analytics is not a bleeding-edge practice.” (Forrester, “In-Database Analytics: The Heart Of The Predictive Enterprise”, November 12, 2009). Instead, it represents the latest generation of database management system (DBMS) approaches that have included stored procedures, user-defined functions, and custom database extensions. According to Forrester, benefits of in-database analytics extend beyond data mining and predictive analytics applications: “In-database analytics is a key acceleration approach for other EDW and BI functions such as data discovery, extraction, collection, correlation, profiling, cleansing, extraction, transformation, loading, joining, filtering, consolidating, and aggregating.”

Many data warehouse platform suppliers support in-database analytics capabilities through integration with SQL-MapReduce. These include Aster Data, Greenplum, Netezza, Sybase, and Vertica. (Disclosure: Aster Data and Greenplum are Big Data News sponsors). Depending on the platform, business analysts and statisticians who may not be experts in writing complex SQL can work in R, S-Plus, Predictive Modeling Markup Language (PMML), Eclipse-based integrated development environments (IDEs), Hadoop, service-oriented architectures (SOA), or other interfaces / programming languages to execute in-database analytics.

As Jim Kobielus and his colleagues note in the previously referenced Forrester report, one advance of using MapReduce and Hadoop is that is does not lock application developers into a single DBMS. Developers can build analytics applications in a platform-neutral MapReduce/Hadoop framework, and alleviate time-consuming tasks to re-apply stored procedures or UDFs among more than one database system.

MapReduce has caught the attention of the general-purpose-database market leaders too. While Microsoft has been skeptical of adopting MapReduce into its commercial software, Microsoft SQL Server 2008 R2 Parallel Data Warehouse includes some initial “MapReduce-like” features, based on the DATAllegro acquisition. (MapReduce advocates, I'm sure, would alter "MapReduce-like" to "MapReduce-light" or something to that affect.) According to Microsoft, that product is designed in part as the hub of a hub-and-spoke EDW.

IBM Software Group's Emerging Technologies division is using Hadoop to enable ad-hoc data integration via web mashups in an offering called BigSheets. (I believe this is the same initiative that IBM previously called M2, for “Massive Mashups”). IBM, though, has been slower than high-end DBMS-specific vendors in delivering or announcing a public roadmap for MapReduce support within DB2 or IBM’s Informix OLTP database.

Organizations using Oracle can integrate with MapReduce through Parallel Pipelined Table Functions – Oracle Senior Principal Product Manager Jean-Pierre Dijcks has a good blog post that explains how. However, following an single-vendor integrated stack strategy, Oracle’s primary in-database value proposition focuses on in-database mining in the Oracle Database kernel, together with data-preparation tools within the Oracle DBMS; the ability to analyze large volumes of text and other unstructured information, including as feeds to Oracle enterprise applications; and support for statistical algorithms and variable selection techniques.

In addition to Sybase support for MapReduce, Sybase has ported Fuzzy Logix DB Lytix statistical and predictive analytics algorithms library on Sybase IQ using an in-database analytics API (application programming interface). DB Lytix offers the ability to perform advanced in-database analytics through simple SELECT and EXECUTE statements.

Netezza works with Fuzzy Logix to provide quantitative models and web-based solutions for Internet marketing optimization, behavioral segmentation, predictive analytics, inventory optimization and other solution applications.

Continue to Part 2