|
DATA WAREHOUSE ARCHITECTURAL
CONSIDERATIONS
by William Laurent
Before we start talking about how we will implement an effective data warehouse,
we need to start talking about what (architecture, feeds, etc.) is in place
already. This document has been broken this document into distinct bulleted
sections that represent physical and/or logical particulars of warehouse design,
with the hopes that they will serve as a catalyst that will help project colleagues
and contributors start talking about the "hows" of implementation. From each
salient point listed, the team can focus and drill down to determine issues
and approach where applicable.
Much of what I have categorized must be imbued with, and determined by, definitive
and measurable business objectives. Certain concepts and practices that are
fundamental to designing an effective warehouse architecture and model need
to be driven by business needs. Such things as aggregations, user access requirements,
scheduled data availability, security roles, and so on--these are all dependant
on user needs, Service Level Agreements, and technology limitations, just to
name a few. With so many areas of responsibility and development, rolling out
one functional area of a data warehouse to end users while simultaneously constructing
other components is usually the best way to keep management content and ensure
continued upper level buy-in for the project. Permutations are endless. So
in no particular order, I offer a mish-mash of notable issues that the perspicacious
data warehouse analyst will want to snuggle up to.
Define Data Requirements and Determine How Users Want to Look at the
Data.
- Define general level of data granularity, data aggregations, etc..
- Define hierarchies/trees.
- Create definition of dimensions.
- Devise calculations: sums, averages, statistics (across multiple dimensions),
etc..
- Define roll-ups, drill downs, aggregates.
- Define history.
- Define delta refresh methods and requirements.
- Define preferred access methods.
Determine Automation and Execution of ETL Jobs and Other Batch Oriented
Type Processes.
- Inventory all relevant components.
- Determine interface requirements.
- ETL and batch integration of parts.
- Scheduling and cron process on the operating system servers.
- Create and define process control modules (PCM) and job wrappers.
QA and Testing
- Determine how multiple terabyte environments will be maintained for test,
QA, and production environments.
- Determine subsets of production data in test and QA.
- Dumps, loads, index creation, etc, of all environments.
- Create migration plans, release notes, and production support documents.
- Employ robust version control and tagging processes for parallel development
efforts across all lines of business.
- Don't overlook regression testing.
Reliance on "third party" data
- Grasp compliance and CRM issues.
- Identify what data is hosted outside company--the system periphery.
- Formulate Business Continuity and contingency plans accordingly (per
above).
- Leverage data by employment of XML interfaces and robust messaging.
- Open the door for discussion on e-commerce, and how to better serve and
empower the customer, vendors, and partners.
Know the Limitations and Strengths of your ETL tools and processes.
- May need to create pivots from long to short tables.
- Ask questions about native DB Drivers. Don't rely on slow ODBC connections.
- Understand fully how your tools deal with data truncation issues or problems.
- Understand fully how your tools deal with NULL values under all DML circumstances.
- Understand array functionality (in order to simulate and create pivot
tables.)
- Know the strength of tool integration with meta data repositories.
- Know the strength of tool integration with XML (output/input).
- Identify potential parsing and integration problems with tool's error
and process logging.
- Spot physical dependencies and conflicts of ETL batch jobs on other processes.
- With the help of meta data, document each ETL job's multiple sources,
targets, hubs, and spokes. Multi-mode inserts and updates quickly become
complex.
- Ask hard questions about the tool's scalability, bulk and delta capabilities,
and source file/scanner abilities.
- Make sure transformation rules are not buried in ETL tool code and kept
in a repository.
OLAP
- Understand implementation issues: Investigate the need for separate physical
OLAP server.
- Think about how OLAP tool will pull data from the warehouse.
- Define a security model and responsible business analysts, approvers,
publishers.
- Shoot for multiple publishing and versioning capabilities: This can offer
a very scalable data mart solution that keeps user warehouse access to
a minimum. The data warehouse remains a black box and you have data separation
on the last node of processing.
- History
- Know what kind of history is needed on all levels. For instance: different
tables may have different requirements, i.e., transaction history is frequent
and address change history is infrequent.
- Be familiar with history pitfalls: may be deathly costly in performance,
maintenance, etc..
- History must be user driven and defined from the beginning of the project.
- Define audit/access trails (user access and change history).
- Reach early agreements with the business on transactional history issues.
- Don't forget about process history (process times, benchmarks, etc.)!
Data Dictionary and Meta Data
- Establish the need for an enterprise data dictionary and meta data repository
from the beginning. (Mapping all the data points to new warehouse data is
a massive job / Different physical functional areas may share data elements
/ Cross-reference scenarios many times are cause for revision in data models,
etc.)
- Know how to integrate this data with ETL tools, different schemas, data
access layers, and onwards.
- Make plans to host the data dictionary on corporate intranet.
- Determine the best way to have centralized yet integrated meta data.
You do not want meta data that quickly becomes orphaned or siphoned off
into individual data marts, becoming local meta data. You need to share
data definitions and semantics across the enterprise in order to avoid
stovepipe data marts and provide a single version of the truth.
- Know which business rules are extremely dynamic and function specific
(i.e. compliance, legal); they may be candidates to stay on the application
server.
- Remind yourself that meta data should be approached from a mindset of
integrating functional areas in a transparent way. Do not attempt create
federated meta data unless you really know what you are getting yourself
into!
Error Checking and Logging.
- Promulgate requirements and assessments of error handling, error logging,
error reporting, and exception logic/flow.
- Understand where errors may occur (UNIX, ETL, database RI, Java, OLAP,
etc.) and handle them correctly, in other words, store them in a database
table.
- Prescribe an error threshold matrix: How many error instances of type
n necessitate job failure or stoppage?
- Define validation failure levels.
- Create e-mail enabled error notification.
- Know when to use RI and when to avoid it. Same with triggers and other
data base objects than may affect performance and/or integrity--often at
the expense of each other.
Process and Flow Control Logging
- Log job process information to attain effective benchmarking and comparative
performance valuations: Processing a start and end time logged for each job
establishes a benchmarking foundation as well as mean/median run times for
SLA agreements, troubleshooting, etc.
- Detail requirements for robust process control meta data. Processes that
are part of batch jobs--stored procedures, shell scripts, OS and middleware,
ETL routines, etc.--as well as parameters for those jobs can be maintained
in meta data. Jobs can then be called sequentially in a programmatic fashion
and new jobs and parameters can be added easily. It will become easy to
track activity by the user population in order to pinpoint performance
problems, facilitate maintenance, perform load balancing on hardware, etc.
Can include such things as query start and end times, keep track of most
used queries, number of reads on database, as well as number of rows returned,
and other relevant factiods.
Fault Tolerance and Recovery
- If the goal for your warehouse is 24-7 availability, never forget it!
- You always need an up-to-date and complete back up and recovery plan.
- Conduct frequent hardware needs assessments (DB servers, application
servers, OLAP servers, etc.) and inventory.
- Know how your data's movement impacts the network and its bandwidth.
- Study RAID(0) disk striping and Raid(5) fault tolerance. Understand fail-over
and redundancy issues and implement a logical and physical architecture
accordingly.
- Practice disc mirroring. Split databases across multiple disks (one to
one--segment to devices) enhancing use of segments. You must have a good
DBA who knows VLDB and hardware tricks, and who feels up to the task.
Application Server
- Must consider that technology decisions made concerning application servers
will play a huge part in an overall data warehouse architecture and its future
scalability. A true n-tier warehouse will be able to offer intelligent load
balancing and fail-over (via application server clustering)--the ability
to support any combination of front and back-ends, with high availability
and transparency.
- Is there an advantage to keeping system authentication in one place,
not specific to databases or applications? Sometimes it may be a good idea
to authenticate on the application server (Access Control Layer/ACL) and
pass the role to the database. Connections may be pooled, offering better
system scaling and less a system less burdened with access logic.
- Secure system access at all tiers--web server/NT, application server,
database, as well as firewall. Understand single sign-on paradigms, with
application server playing traffic cop via ACL, SSL, transaction objects,
and brethren. With different front ends or a portal accessing warehouse,
thinking about an application server becomes even more important.
Database Physical Design and Performance
- Estimate the warehouse storage requirements: monthly, weekly, yearly.
- Define ODS and staging areas for dynamic and extensible processing.
- Chart database load, backup, replication, and restoration times.
- Know when to use views and when to shy away from them.
- Consider that your detail and summary data should probably be in different
schemas, your main data warehouse non-volatile, and possibly a black-box
like structure.
- Practice debating and choosing sides over the battle of "Star/Snowflake
Schema verses Normalization".
- Try to detect when you may run into performance tradeoffs in your DML.
- Will you use views as "virtual data marts"?
- See the need for a staging area--either separate or incorporated into
an ODS type structure. We don't want to upset the ODS with bulk loads;
however, we don't want un-validated data to go into the black box data
warehouse.
- Know exactly where the best physical area in the database (ODS, staging,
etc.) will be to store data that has been aggregated?
- Understand the danger and advantages non-logged activity such as BCPs
and table truncates. Weigh options that may minimize lock contention in
the warehouse or data marts. For instance BCPs and table truncates offer "non-logged" operations
that my help you bypass bottlenecks.
- The warehouse will require a well thought out sizing, partitioning, and
indexing strategy as well as deft mapping/placement of database objects
to hardware.
- Clustered indexes should be used very carefully with non-clustered indexes
stripped on a separate physical disk from the base table where needed.
- Have enough time windows to update indexes and statistics on huge tables
to clean up fragmentation, page splits, etc.
- Will main warehouse never have to deal with the burden of validation?
Will only pristine data be inserted into this area, with a staging area
taking up the validation and scrubbing burdens?
I have only listed a few warehouse cogitations, with the hope that you (as
the data architect, manager, or other important person) may set in motion the
creative juices necessary for the actualization and manifestation of a great
data warehouse. The variations on these themes will be endless. Numerous new
considerations--far above and beyond what I have listed--will take you down
roads never before traveled. Good Luck!
|
|
|