|
EVALUATING ETL TOOLS
by William Laurent
A good ETL tool will easily scale up (an d down)--from a departmental solution
to one suitable for the enterprise--to meet the most demanding requirements
of any project, regardless of platform. There are more than 100 tools on the
market that purport to have some ETL function, many of which offer complexity
in the face of scalability. To load, change, reshape , and migrate data to
both star schema and normalized databases, you must rely heavily on both sequential
and parallel processing. Although the point - and - click graphical features
of many tools seem to promise quick and easy implementation, the fact is that
they don't deliver what they promise --you will have to write code to handle
the bulk of your processing. But ETL is about more than transformation and
scrubbing of dat a; a utomation, scheduling, and communications utilities all
have to be provided within your ETL environment. Tools that focus primarily
on straightforward extraction and vanilla transformation/loading processes
should actually be referred to as data migration tools instead of full-function
ETL processor s . You must learn to identify the outstanding qualifications
as well as the caveats of each ETL product you evaluate. Applying the questions
and criterion that follow will help your evaluation team¡¯s objective
analysis.
- Database Connectivity : What databases does this product
connect to? What non - database data sources (mainframe, XML, flat files,
etc.) will this product connect to?
- Update Capabilities : Can the product dynamically insert
flagged data into a table or update it when appropriate?
- Surrogate Key Support : Does the product support the use
and generation of surrogate keys in the warehouse?
- Change Data Capture : Does the product support reading
sources conditionally?
- Scalability and Continuity : Will the scalability of the
tool meet future demands due to system growth? How will it perform under
unusual transactional or volumetric circumstances?
- Conformity, Integration and Independence: How will the
tool behave and operate against existing architectural limitations and requirements?
Does it support the hardware platform s and operating system s currently
used? Will it integrate with the current network and user security protocols?
Is there true platform independence--i.e. the application works the same
on all platforms and there will be no nasty surprises when porting compiled
code from NT to UNIX --the application will behave exactly the same under
two separate platforms and environments?
- Intelligent Queries: Does the product write efficient
queries? Can you leverage the performance tuning techniques of the database
from the product?
- Multi-Source Joins: Can the product join data from multiple
sources in a simple, and more importantly, a quick and efficient manner?
- Aggregate Capabilities : Does the product aid in the creation
of aggregates?
- BI Tool Integration and Modeling Tool Connectivity: How
does the tool integrate with other tools, such as design tools like ERWin
and ER Studio, or reporting tools like Brio, Cognos, etc.?
- Metadata S upport: What types of metadata
is collected? What reporting facilities are available within the tool?
- Concurrency, Parallelization Tool Architecture : Does
the tool take advantage of the efficiencies of the database and hardware
such as parallelism or SMP/MMP ? What are the maximum number of simultaneous
developers and production processes support ed? Are true multithreading or
multi processor jobs possible? Is the threading unlimited or set within the
product? Are multiple jobs required to achieve high degrees of parallelization
or can processing only be done in a serial fashion ?
- Quality Assessment and Management: Does the tool support
data quality measurement on an on-going basis? What valuable functionality
to handle error detection and processing is inherent in the tool? How flexible
is the error condition testing? How are errors logged?
- Logging : What are the logging features that help you
monitor loads and handle error conditions? How robust are the debugging mechanisms
?
- Scheduling and Restartability : Can you schedule your
loading processes (and restarts) using the product, or do you need to purchase
another product to support this functionality --what is built into the tool
and what is extra? What features does it have in place for data backup and
local/distributed disaster recovery? Can the tool recover from a failed job
without manual intervention? How flexible is the tool when executing dependent
or sequential job streams? How well does the product respond to inevitable
failures such as network problems, server crashes , or running out of disk
space?
- Ease of Development : How is the learning curve? How well
does the tool support testing and debugging while in development modes--do
you have to run a whole job all the way through in order to get error results?
Are there transformation templates provided? Can transformations be built
via drag-and-drop? How must the developer tackle variable length files with
the tool? Can developers control the compilation process?
- Price and Vendor Comfort: What is the bottom line price
(including maintenance) of the product? Are you comfortable with the vendor¡¯s
stated price to performance ratio? What do IT advisory services like Gartner,
Forrester, Yankee, and META have to say about the value of the product? Remember
that real cost represents the purchase price and the hidden
support costs associated with each product: It may be necessary to purchase
additional hardware, software, or training to realize the full potential
of the product in the context of your project.
- Administration: How well does the tool handle issues
of versioning and auditing ? What form of documentation does the tool generate?
Will this documentation meet internal documentation standards, both departmental
and enterprise? Does the tool support performance and throughput benchmarking?
What reporting capabilities (per error log, run log, etc.) are provided?
How does the documentation stack up?
- Product Acceptance and Market: What is the global installation
base for the product? Are there viable and conspicuous users groups? Are
there technical and discussion forums on the internet for the toolset/suite
?
- Support: What is the product support and training that
you are being offered from the vendor? Are you comfortable with the availability
(24/7, consulting, offshore vs. onshore) of the support? Does the maintenance
contract seem fair? How often does the vendor upgrade the product? How painful
are the upgrade installations? Will the vendor really work with you to help
you define and meet your ETL objectives?
- Cursors: How well are cursors supported? Do ETL processes
create cursor processing in the underlying native database that make it impossible
to view data updates and DML in real time? ( Many products rely on system
cursor based processing and therefore are not 100% open to tuning or real
time row-by-row analysis of data processing. )
- Versioning and Security: . Can security on database objects,
data stores, and files , as well as transformation scripts be controlled
through the tool? How are versioning and change management functions handled
within the tool? What happens when developers attempt to make changes to
the same routine or transformation object at the same time? Can older incarnations
of various routines be retained for reference?
- Proprietary Lang uages : What programming languages are
supported: VB, C, C++, BASIC, COBOL? Are the built-in tool functions, string
parsing , and macros robust ? Does it have a custom script based modeling
language aimed specifically at data warehousing and optimization of data
movement and cleansing? How long for developers to get up to speed in learning
new languages or methodologies?
The data convergence and warehousing goals of each business will be unique
and different depending on their respective needs . Desires will be diverse
and numerous; therefore, high-level core needs should be identified and agreed
upon up front. Examples:
We must have a tool that runs on {specific operating system}.
We must be able to process {amount} GB of data nightly in a {number}
hour batch window.
We must have connectivity to {type of data source}.
We must be able to distribute our data to {number and locations
of users}.
We must have a tool that supports concurrent development by multiple
sites.
After high-level considerations are enumerated,
you can start to examine and list the specific requirements of your project
and begin to ask pointed questions to your vendors, ensuring that all evaluations
and purchases of ETL products are prudent.
|
|
|