|||
Summary: The boundaries that have traditionally existed between DBMSs and other data sources are increasingly blurring, and there is a great need for an information integration solution that provides a unified view of all of these services. This article proposes a platform that extends a federated database architecture to support both relational and XML as first class data models, and tightly integrates content management services, workflow, messaging, analytics, and other enterprise application services.
© 2002 International Business Machines Corporation. All rights reserved.
The explosion of the Internet and e-business in recent years has caused a secondary explosion of information. Industry analysts predict that more data will be generated in the next three years than in all of recorded history [INFO]. Enterprise business applications can respond to this information overload in one of two ways: they can bend and break under the sheer volume and diversity of such data, or they can harness this information and transform it into a valuable asset by which to gain a competitive advantage in the marketplace.
Because the adoption of Internet-based business transaction models has significantly outpaced the development of tools and technologies to deal with the information explosion, many businesses find themselves unintentionally using the former approach. Significant development resources are spent on quick and dirty integration solutions that cobble together different data management systems (databases, content management systems, enterprise application systems) and transform data from one format to another (structured, XML, byte streams). Revenue is lost when applications suffer from scalability and availability problems. New business opportunities are simply overlooked because the critical nuggets of information required to make a business decision are lost among the masses of data being generated.
In this article, we propose a technology platform and tools to harness the information explosion and provide an end-to-end solution for transparently managing both the volume and diversity of data that is in the marketplace today. We call this technology information integration. IBM provides a family of data management products that enable a systematic approach to solve the information integration challenges that businesses face today. Many of these products and technologies are showcased in the Information Integration technology demo.
The foundation of the platform is a state-of-the art database architecture that seamlessly provides both relational and native XML as first class data models. We believe that database technology provides the strongest foundation for an information integration platform for three significant reasons:
This paper is organized as follows:
Figure 1 captures the evolution of relational database technology. Relational databases were born out of a need to store, manipulate and manage the integrity of large volumes of data. In the 1960s, network and hierarchical systems such as [CODASYL] and IMSTM were the state-of-the -art technology for automated banking, accounting, and order processing systems enabled by the introduction of commercial mainframe computers. While these systems provided a good basis for the early systems, their basic architecture mixed the physical manipulation of data with its logical manipulation. When the physical location of data changed, such as from one area of a disk to another, applications had to be updated to reference the new location.
A revolutionary paper by Codd in 1970 [CODD] and its commercial implementations changed all that. Codd's relational model introduced the notion of data independence, which separated the physical representation of data from the logical representation presented to applications. Data could be moved from one part of the disk to another or stored in a different format without causing applications to be rewritten. Application developers were freed from the tedious physical details of data manipulation, and could focus instead on the logical manipulation of data in the context of their specific application.
Not only did the relational model ease the burden of application developers, but it also caused a paradigm shift in the data management industry. The separation between what and how data is retrieved provided an architecture by which the new database vendors could improve and innovate their products. [SQL] became the standard language for describing what data should be retrieved. New storage schemes, access strategies, and indexing algorithms were developed to speed up how data was stored and retrieved from disk, and advances in concurrency control, logging, and recovery mechanisms further improved data integrity guarantees [GRAY][LIND] [ARIES]. Cost-based optimization techniques [OPT] completed the transition from databases acting as an abstract data management layer to being high-performance, high-volume query processing engines.
As companies globalized and as their data quickly became distributed among their national and international offices, the boundaries of DBMS technology were tested again. Distributed systems such as [R*] and [TANDEM]distributed data. Distributed data led to the introduction of new parallel query processing techniques [PARA], demonstrating the scalability of the DBMS as a high-performance, high-volume query processing engine. showed that the basic DBMS architecture could easily be exploited to manage large volumes of
The lessons learned in extending the DBMS with distributed and parallel algorithms also led to advances in extensibility, whereby the monolithic DBMS architecture was replumbed with plug-and-play components [STARBURST]. Such an architecture enabled new abstract data types, access strategies and indexing schemes to be easily introduced as new business needs arose. Database vendors later made these hooks publicly available to customers as Oracle data cartridges, Informix® DataBlades®, and DB2® ExtendersTM.
Throughout the 1980s, the database market matured and companies attempted to standardize on a single database vendor. However, the reality of doing business generally made such a strategy unrealistic. From independent departmental buying decision to mergers and acquisitions, the scenario of multiple database products and other management systems in a single IT shop became the norm rather than the exception. Businesses sought a way to streamline the administrative and development costs associated with such a heterogeneous environment, and the database industry responded with federation. Federated databases [FED] provided a powerful and flexible means for transparent access to heterogeneous, distributed data sources.
We are now in a new revolutionary period enabled by the Internet and fueled by the e-business explosion. Over the past six years, JavaTM and XML have become the vehicles for portable code and portable data. To adapt, database vendors have been able to draw on earlier advances in database extensibility and abstract data types to quickly provide object-relational data models [OR], mechanisms to store and retrieve relational data as XML documents [XTABLES] , and XML extensions to SQL [SQLX].
The ease with which complex Internet-based applications can be developed and deployed has dramatically accelerated the pace of automating business processes. The premise of our paper is that the challenge facing businesses today is information integration. Enterprise applications require interaction not only with databases, but also content management systems, data warehouses, workflow systems, and other enterprise applications that have developed on a parallel course with relational databases. In the next section, we illustrate the information integration challenge using a scenario drawn from a real-world problem.
To meet the needs of its high-end customers and manage high-profile accounts, a financial services company would like to develop a system to automate the process of managing, augmenting and distributing research information as quickly as possible. The company subscribes to several commercial research publications that send data in the Research Information Markup Language (RIXML), an XML vocabulary that combines investment research with a standard format to describe the report's meta data [RIXML]. Reports may be delivered via a variety of mechanisms, such as real-time message feeds, e-mail distribution lists, web downloads and CD ROMs.
Figure 2 shows how such research information flows through the company.
To build the financial services integration system on today's technology, a company must cobble together a host of management systems and applications that do not naturally coexist with each other. DBMSs, content management systems, data mining packages and workflow systems are commercially available, but the company must develop integration software in-house to integrate them. A database management system can handle the structured data, but XML repositories are just now becoming available on the market. Each time a new data source is added or the information must flow to a new target, the customer's home grown solution must be extended.
The financial services example above and others like it show that the boundaries that have traditionally existed between DBMSs, content management systems, mid-tier caches, and data warehouses are increasingly blurring, and there is a great need for a platform that provides a unified view of all of these services. We believe that a robust information integration platform must meet the following requirements:
Figure 3 illustrates our proposal for a robust information integration platform.
Figure 3. An information integration platform
As shown in the figure, the data tier is an enhanced high performance federated DBMS. We have already described the evolution of the DBMS as a robust, high-performance and extendable technology for managing tructured data. We believe that a foundation based on a DBMS architecture allows us to exploit and extend these key advances to semi-structured and unstructured data.
Storage and retrieval. Data may be stored as structured relational tables, semi-structured XML documents, or in unstructured formats such as byte streams, scanned documents, and so on. Because XML is the lingua franca of enterprise applications, a first class XML repository that stores and retrieves XML documents in their native format is an integral component of the data tier. This repository is a true native XML store that understands and exploits the XML data model, not just a rehashed relational record manager, index manager, and buffer manager. It can act as a repository for XML documents as well as a staging area to merge and consolidate federate data. In this role, meta data about the XML data is as critical as the XML data itself. This hybrid XML/relational storage and retrieval infrastructure not only ensures high performance, and data durability for both data formats, but also provides the 24x7 availability and extensive administrative capabilities expected of enterprise database management systems.
Federation. In addition to a locally-managed XML and relational data store, the data tier exploits federated database technology with a flexible wrapper architecture to integrate external data sources [WRAP]. The external data sources may be traditional data servers, such as external databases, document management systems, and file systems, or they may be enterprise applications such as CICS® or SAP, or even an instance of a workflow. These sources may in turn serve up structured, semi-structured or unstructured data.
The services tier draws on features from enterprise application integration systems, content management systems and exploits the enhanced data access capabilities enabled by the data tier to provide embedded application integration services.
Query processing. In addition to providing storage and retrieval services for disparate data, the data tier provides sophisticated query processing and search capabilities. The heart of the data tier is a sophisticated federated query processing engine that is as fluent with XML and object-relational queries as it is with SQL. Queries may be expressed in SQL, SQLX, or XQuery and data may be retrieved as either structured data or XML documents. The federated query engine provides functional compensation to extend full query and analytic capabilities over data sources that do not provide such native operations, and functional extension to enable extended capabilities such as market trend analysis or biological compound similarity search.
In addition to standard query language constructs, native functions that integrate guaranteed message delivery with database triggers [MQDB2] allow notifications to fire automatically based on database events, such as the arrival of a new nugget of information from a real-time data feed.
Text search and mining. Web crawling and document indexing services are crucial to navigate the sea of information and place it within a context usable for enterprise applications. The services tier exploits the federated view of data provided by the data tier to provide combined parametric and full text search over original and consolidated XML documents and extracted meta data.. Unstructured information must be analyzed and categorized to be of use to an enterprise application, and for real-time decisions, the timeliness of the answer is a key component of the quality. The technology platform integrates services such as Intelligent Miner for Text to extract key information from a document and create summaries, categorize data based on predefined taxonomies, and cluster documents based on knowledge that the platform gleans automatically from document content. Built-in scoring capabilities such as Intelligent Miner Scoring integrated into the query language [SQLMM] turn interesting data into actionable data.
Versioning and meta data management. As business applications increasingly adopt XML as the language for information exchange, vast numbers of XML artifacts, such as XML schema documents, DTDs, Web service description documents, etc., are being generated. These documents are authored and administered by multiple parties in multiple locations, quickly leading to a distributed administration challenge. The services tier includes a WebDav-compliant XML Registry to easily manage XML document life cycle and meta data in a distributed environment. [WebDAV] [XRR]. Features of the registry include versioning, locking, and name space management.
Digital asset management. Integrated digital rights management capabilities and privilege systems are essential for controlling access to the content provided by the data tier. To achieve these goals, the information integration platform draws on a rich set of content management features (such as that provided in IBM Content Manager) to provide integrated services to search, retrieve and rank data in multiple formats such as documents, video, audio, etc., multiple languages, and multi-byte character sets, as well as to control and track access to those digital assets.
Transformation, replication and caching. Built-in replication and caching facilities [CACHE] and parallelism provide transparent data scalability as the enterprise grows. Logic to extract and transform data from one format to another can be built on top of constraints, triggers, full text search, and the object relational features of today's database engines. By leveraging these DBMS features, data transformation operations happen as close to the source of data as possible, minimizing both the movement of data and the code path length between the source and target of the data.
The top tier visible to business applications is the application interface, which consists of both a programming interface and a query language.
Programming Interface. A foundation based on a DBMS enables full support of traditional programming interfaces such as ODBC and JDBC, easing migration of legacy applications. Such traditional APIs are synchronous and not well-suited to enterprise integration, which is inherently asynchronous. Data sources come and go, multiple applications publish the same services, and complex data retrieval operations may take extended periods of time. To simplify the inherent complexities introduced by such a diverse and data-rich environment, the platform also provides an interface based on Web services ([WSDL] and [SOAP]). In addition, the platform includes asynchronous data retrieval APIs based on message queues and workflow technology [MQ] [WORKFLOW]to transparently schedule and manage long running data searches.
Query Language. As with the programming interface, the integration platform enhances standard query languages available for legacy applications with support for XML-enabled applications. [XQuery] is supported as the query language for applications that prefer an XML data model. [SQLX] is supported as the query language for applications that require a mixed data model as well as legacy OLTP-type applications. Regardless of the query language, all applications have access to the federated content enabled by the data tier. An application may issue an XQuery request to transparently join data from the native XML store, a local relational table, and retrieved from an external server. A similar query could be issued in SQLX by another (or the same) application.
The explosion of information made available to enterprise applications by the broad-based adoption of Internet standards and technologies has introduced a clear need for an information integration platform to help harness that information and make it available to enterprise applications. The challenges for a robust information integration platform are steep. However, the foundation to build such a platform is already on the market. DBMSs have demonstrated over the years a remarkable ability to managed and harness structured data, to scale with business growth, and to quickly adapt to new requirements. We believe that a federated DBMS enhanced with native XML capabilities and tightly coupled enterprise application services, content management services and analytics is the right technology to provide a robust end-to-end solution.
Mary Roth is a senior engineer and manager in the Database Technology Institute for e-Business at IBM's Silicon Valley Lab. She has over 12 years of experience in database research and development. As a researcher at the Almaden Research Center, she contributed key advances in heterogeneous data integration techniques and federated query optimization and led efforts to implement federated database support in DB2. Mary is leading a team of developers to deliver a key set of components for Xperanto, IBM's information integration initiative for distributed data access and integration.
Dan Wolfson is a Senior Technical Staff Member and manager in the IBM Database Technology Institute for e-Business. With more than 15 years of experience in distributed computing, Dan's interests have ranged broadly across databases, messaging, and transaction systems. Dan is a lead architect for Xperanto, focusing on DB2 integration with WebSphere, MQ Series®, workflow, Web services, and asynchronous client protocols.
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-5-18 11:47
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社