Senthil On Data: 2008

Tuesday, October 21, 2008

Infobright's column based datawarehousing

Infobright, a open source data warehousing start up addresses the performance issues that usually come along with a data warehouse by implementing a highly compressed column-oriented store. The data is stored in columns instead of rows. This allows for reduced I/O because of the compression ratios obtained on the columns. Data is stored as 65K blocks or nodes containing a meta data store about the relationships between columns.

Some of the key customers of Infobright are RBC Royal Bank and Xerox. They claim their product would be ideal for data warehouses ranging from 500GB to 30TB. Their compression ratios are close to 40:1 according to their community blogs. The most attractive feature about them was the compatibility with the existing Business Intelligence tools like Business Objects and Pentaho.

I wasn't very convinced with the concurrency offered by them. It supports 50-100 users with 5-10 concurrent queries. I will watch for the progress of this new exciting player in the already crowded BI market.

Thursday, July 10, 2008

Release Early Release Often

Release Early Release Often (RERO) technique proposes to have releases early and often, instead of a big bang release. This approach is typically followed in tech startups, working on Open source projects. That’s the reason we see many of Google’s products still in beta version and their updates getting released once in a month or so. We planned to experiment the strategy for a big Master Data Management (MDM) project. The experimentation turned out to be successful. The rest of the essay discusses the experience details of such an implementation.

User Thrill

Important features of the application were phased out for various distinct releases. Some of them were Hierarchy & Workflow management, Security and Exception reporting. And the duration between releases were as close as 2 weeks. That meant, the user saw features getting added once in 2 weeks. We captured the user feedback about the releases and made sure we corrected it in the immediate ones. This approach had a two prong benefit. User experienced the application very, very early and we experienced the bugs. By the time, the UAT phase reached us, the application had reached a near-to-zero defect zone. We were a bit skeptical whether the user participation would be high, but since the product was there to be played with, it naturally attracted them.

Incremental Application testing

The application was getting tested from the day the first beta was released; rather from the “Go Live” day. Although this created few negative impressions on the user experience due to few unpleasant bugs; they knew that it was in its beta stages and the next release would have the patched version. In fact, our testing team grew from a 3 member team to a 6 member virtual team (There were 3 business users).

Support framework

To enable such a dynamic release process, the revision control and the code review/release systems should be efficient; there would be multiple releases instead of one. The integration testing should be solid. And the unit testing before the releases should be good enough not to distract your users completely; dissolving the purpose. Meticulous planning of the releases will also form a key to the success. The development tools that you use should be agile and adaptable enough to accept and implement the user’s feedback for the next release.

Conclusion

The experiment turned out to be a success. This strategy would work for most of your implementations, unless it’s a maintenance project with less than a week’s duration of deliverable.

Wednesday, June 25, 2008

Which MDM approach is right for you?

MDM, in the past 5 years, has come a long way in its maturity model. Most of the MDM implementations fall under 2 different kinds of approaches.

Operational MDM (the tougher among the two)
Analytical MDM

Operational MDM enables synchronization of master entities and their attributes between the transaction processing systems. Why does one need such an MDM? Let's take an example. ABC Corporation is a manufacturing firm. It conducts roadshows and marketing campaigns to advertise its products. The salesperson collect customer information during those roadshows and feed it into their IT systems for further followup. There are a different set of sales representatives who conduct feedback on their products sold, with their customers. They too enter the customer feedback into their IT systems. These are 2 different sets of CRM processes.

Typically what happens in a mature company is, there are a set of batch processes which pick up the master data from one system and transfer it to the other. Now this introduces delay, inconsistency, inaccuracy of data and lot of manual reconciliation (same customer name can be entered by 2 different salesperson or the latest survey from a salesperson can erase previously collected information about the customer). So the IT develops custom programs to clean up the data, write reconciliation programs but still cannot manage to do all this in real time.

This mess can be reduced or eliminated by deploying an operational MDM. Operational MDM tools solve the synchronization problem using complex match-merge algorthims. Some of the tools currently in the market are Siperian, IBM, Purisma, Oracle and SAP.

Analytical MDM is an architectural approach if the problem revolves around inconsistent reporting for business performance management. In simple terms, inconsistent hierarchies are getting reported out. This needs for a unified reporting view of the master data. The audience for this system would be the downstream data warehousing and business intelligence applications. Some of the MDM vendors selling their expertise in this area are Kalido, Oracle, IBM.

It is essential that an organization has to build both these models to address their MDM needs. But which one to chose first depends on which problem is in their high priority list.

Friday, June 20, 2008

Teradata's reseller alliance with Trillium

Teradata Corporation announced its reseller alliance partnership with Trillium Software. Teradata will now combine its warehouse product with Trillium's Data quality tools and its own MDM products. Overall, this seems to be a good strategy for Teradata, because now Teradata's customers can leverage Trillium's data quality abilities on their huge databases.

Because of this alliance, the customers will enjoy a powerpacked database, Data Quality tools and a MDM suite. Information Difference has ranked Teradata's MDM low in the quadrant though compared to the likes of SAP, Oracle and Siperian.

Thursday, June 19, 2008

Buy or Make - Financial Analytics

Today, I had a consulting assignment with a company focussing on Server Virtualization. The objective was to narrate the factors influencing a Make vs Buy (mVb) decision and their risk quotients for a Financial Analytics Solution.

Some of them are :

What is the business requirement and is the requirement very unique?
How urgent is the application?
What is the technology Strategy of the Organization?
Does the off-the-shelf product address most of the requirements and does it have flexibility to customize it?
How does the present make-buy decision relate to the strategy?
Are their right people and support systems to support the application, in case of a build?
Does the financial tool address internationalization needs?
Are their security measures in-built in the tool, because it hosts sensitive data?
Can the Integration of the Packaged Solution into the process control system be done seamlessly?
What is the underlying technology? In this case, what is the ERP system? It would make sense to buy the analytical solution from the same vendor of the ERP system, if it addresses your requirements
Will the TCO be reduced because of the Buy approach?
Are their right people and support systems to support the application, in case of a build?
Does it reduce cost?

After these questions were answered, the following matrices were prepared which summed up the decision.

High Level Requirement x Priority x Effort Estimation Matrix
Benefit Comparison Matrix
Risk Comparison Matrix

Tuesday, June 17, 2008

Statistics and Data

I was reading an excellent text "Statistics for Business and Economics" written by Anderson & Sweeney. It highlights the importance of statistical measures in decision making. Many of the existing predictive analytical tools use most of the principles covered in the text. It also highlights the importance of collecting and preserving data.

One such example covered was to calculate the average wait time of a queue in a particular ATM in New York. Using this data, the bank would then decide to position a new ATM to balance the load in that busy place. The predictive model uses probability distribution and helps the analyst in making a decision. The models have to be refined so that they don't reflect any false positives.

Sunday, April 27, 2008

Why should I make my MDM SoA enabled?

A Car Sales Manager is capturing the details of a customer, who visited his showroom. After jotting down the client's address details, the Manager wants to check out if the address is a valid one. How can he achieve it?

A PRO in the same organization receives a call from the customer that he wants to change his address in the system records. The PRO logs in to the silo-ed application and enters the new address. When entering the address, the PRO wants to check if the new address is a valid one. How can she achieve it?

MDM and SoA make this happen. MDM is more like a service provider and SoA is a framework helping the consumers to access the service with ease. A Location Master Repository tied up a SoA framework makes any consumer to use the services of the Location Master. This way, the same master data gets reused throughout the organization for multiple purposes.

In one of our projects, the client had a unique requirement to assess if a product is worth promoting and if found promotable, what is the promotion to be given to it. Such a requirement was addressed using a SoA framework on a product portfolio MDM. The architecture had BEA AquaLogic Service Bus interacting with Kalido MDM to provide the services. The challenges lied in identifying the streams which would consume this service and evaluating if its worth considering a service. The service found its usage in many CRM applications in the organization.

Sunday, April 20, 2008

MDM - SoA Marriage

After a long break from blogging, I am starting my series of explorations with MDM and SoA. When would somebody go for a SoA implementation for an MDM solution? Is it because it's a enterprise-wide initiative to make everything the SoA way? My current project proved to be a big failure on this front. We had to build an MDM solution and the enterprise architecture team had a clear focus to make anything and everything SoA enabled. The MDM solution was built on Kalido. Who are the consumers of this MDM data? Answer is . A series of downstream CRM applications. Sounds good. Where would the SoA architecture fit in? Is it just in the Consumer world or also in the Sourcing world? We had to design the reporting solution completely SoA. But during a period of stress testing, it proved that the SoA framework just couldn't handle the volumes the downstream applications were streaming. The MDM Solution would get a huge number of updates from the ERP stack everyday and all these changes had to be donated to the consumers. The SOAP message was just too big to be parsed by the reporting solution. The users had to wait for a considerable amount of time to get their reports out.

So my question is whether you design the MDM solution the SoA way expecting in future that things with performance would get solved or wait till things get solved and then re-architect the solution?

Currently, we have the SoA suite disabled and reports are being fired out from the databases directly to the reporting solution.

Sunday, February 03, 2008

XML-based MDM

After a brief hiatus, I am writing this article on Orchestra networks' EBX.Platform - an XML based approach to Master Data Management. The EBX.Platform is based on J2EE and XML. The whole architecture is based on 3 items - Models, Services and Modules (Models + Services).

So how are they able to achieve their MDM framework?

The Data Model is developed based on a simple XML schema standard and this is termed as an Adaptation Model. Services such as import/export can be added on top of the adaptation models. It can also be maintenance features provided by the UI. And finally Modules, nothing but Models and Services are deployed as Web applications.

They also support quite a unique feature - Branches & versions of Master Data. This helps a company to maintain its current version of master data, when it is working on a futuristic model.

I would like to monitor the progress of this interesting tool, given that it is being used in some big companies.

Friday, January 25, 2008

Data Modeling a Maze

Couple of weeks back, my friend took me to a maze. I was lost in a couple of minutes and was getting really frustrated after a while. I wasn't sure what algorithm they had used to construct the maze.

The only algorithm that I knew was the "Wall follower". All you have to do is it to follow either your right-hand or left hand touching the wall and you will reach either the exit or the entrance. I did take the longest path, but eventually reached the EXIT.

This algorithm would work only if all the walls are connected to form a loop. From that point, I was quite fascinated with the algorithms associated with the maze. There are also a few other efficient algorithms like Tremaux's algorithm. Visit Think Labyrinth for more fun.

After that, I thought of simulating a maze. Unfortunately, I am quite inept at programming languages, so decided to do what I know best. I thought of creating a simple data model for a Wall Follower Maze. It turned out to be quite an interesting problem. It took me around 30 minutes to come up with a decent logical data model, that would work for quite some scenarios.

So, the first model that I came up with is shown below. (Click on the picture to enlarge)

Let me give a quick explanation of the model

Design Co-ordinate: Super-type Entity
Entry, Exit and In-Maze Coordinate: Sub-Type Entities of DESIGN CO-ORDINATE which contains the co-ordinates of the location, where a player has to take a decision.
Decision: Entity which holds whether to turn LEFT,RIGHT,UP,DOWN or ABORT.
Decision Map: Entity which holds the map of a START-COORDINATE, DECISION TAKEN(whether to move left,right,up,down (or) abort) and an END-COORDINATE (the co-ordinate where he lands after he takes the decision).
Player: Entity which holds information about the player of the maze.
Movement: Entity which tracks the movement of the player.

There is one interesting phenomenon happening in this model. If an intelligent player has to play this model, this model would work, because the association between MOVEMENT-PLAYER-DECISION MAP has been modeled as an Identifying relationship. It means if a player, tries to navigate the same path twice, the system will spit out an error, (simulating an INTELLIGENT player, because he would never do the same mistake again).

But if a DUMB player had to play this maze, then this model wouldn't work, because a DUMB player would make the mistake of traversing a path with no fruits, again and again. So the association should be made as a Non-Identifying relationship.

How can I model both the scenarios at one-shot, without introducing redundancy in the entities or associations?

One of the ways to model both the scenarios is to track both their movements as 2 different MOVEMENT entities and include the constraint in the INTELLIGENT PLAYER MOVEMENT's entity. But this introduces an extra associative entity. It's easy for a toggle situation, Yes or No, DUMB or INTELLIGENT.

But, if I were to model differing levels of intelligence, how would I do it in a data model, without writing any procedural code to do it? How can E-R data models be efficiently designed for Fuzzy Logic Systems?

I found this as an interesting exercise to showcase that E-R data models are still long way from being truly a self sufficient tool.

We need a modern day E.F.Codd

Wednesday, January 23, 2008

Kalido's Business Information Modeler

Today, I received an update from Kalido on the Business Information Modeler Engine. This is what Kalido claims about the product.

"Kalido Business Information Modeler provides a graphical design interface that can be used to develop and refine business requirements for new and existing information. Instead of modeling data and their structures, the Kalido Business Information Modeler allows you to model the actual parts of your business; customers, products, assets, transactions, even people – and define how you want to see information in context. Even better, the Kalido Business Information Modeler can be used to change and update your model directly against the Kalido Dynamic Information Warehouse, allowing you ultimate flexibility in meeting the information needs of your business. The Kalido Business Information Modeler dramatically improves your ability to meet the needs of your business when it requires it – not when how it’s stored determines it".

The product is due for March 2008. I am waiting to experiment on the new features it claims. I will be evaluating the product on the following questions.

Can an in-house data warehouse be easily migrated into Kalido?
Will the business layer completely abstract the data layer?
Is it just a visual aid for creating/maintaining your data model?
Will the data in the warehouse be used by the tool to help the modeler provide real-time feedback on the errors and the inconsistencies of the new model that he plans to implement?

I will be writing more on this interesting product after I get a practical hands-on. Visit www.kalido.com for more details.

Sunday, January 20, 2008

Oracle snaps up BEA systems

Oracle has recently purchased BEA systems for 8.5 billion USD. One of the reasons that I think could be behind the motive of this purchase is that they wanted to get hold of the large customer base that BEA had. This will also help Oracle to compete with IBM tightly in the middle-ware space.

This move also helps Oracle's customers to move into a subscriber based model. Oracle claims that the vision of the acquisition is expected to accelerate innovation by bringing together two companies with a common vision of a modern service-oriented architecture (SOA) infrastructure and will further increase the value that Oracle delivers to its customers and partners.

Oracle has eliminated a strong commercial rival and forayed itself into into the enterprise middle-ware market edging to be the market leader.

Friday, January 18, 2008

Styles of MDM framework

MDM can be essentially fitted into 3 styles of framework

1. Registry based approach: The MDM contains a reference to the actual data stores and doesn't contain the data itself. It has pointers to the respective source systems for the attributes it hosts. Data governance and integrity are left to the source systems to handle. Quick way to setup a MDM. The registry will decide where to pick up the data from at run-time. I haven't seen many companies implementing this model.

2. Centralized Hub: The master data is integrated from different applications, cleansed, standardized, corporate-governed, secured, authorized and published to different subscribers from one central repository. This hosts the entire MDM data. Takes a long time to setup, but its one of the most efficient systems of integrating master data.

3. De-Centralized Regional Hub: This is similar to the previous implementation, but the corporate data is maintained in a global MDM hub and the regional/ business MDM requirements are maintained in a local/regional hub. This clearly differentiates the corporate from the regional needs.

Choosing which model to vote for is one of the key elements for the success of an MDM project.

Thursday, January 10, 2008

MDM - Part 3 - Kalido MDM

This is one of my favorite tools, having worked on it for quite some time now. Let me provide an unbiased opinion on Kalido based on the "key capabilities of MDM" post that I had written.

The first point on Data Governance can be omitted here, as it is more of a process oriented practice.

Does Kalido control the flow of good quality data into the MDM repository? Yes. Kalido provides association rules (1:N, M:N,optional,mandatory), data-type verification, deletion anomalies, data length verification and custom validation formula. Is this enough for an MDM tool to host clean master data independently? Probably no. But Kalido still wins in this sector. Because it covers most of the important validation checkpoints.

Kalido is truly a flexible data modeling tool. It can model time-variant hierarchies, ragged hierarchies, depth-less hierarchies,super-type/sub-type relationships and having done all this, its quite easy to change from one model to the other. This is because it has quite a generic modeling mechanism and most companies which are into heavy-duty acquisitions and mergers prefer it. Kalido completely wins here. I have rarely faced a scenario in Kalido where I wasn't able to model one. Kalido also provides you features for moving the models during the migration process.

The MDM component of Kalido isn't a master in integrating with different heterogeneous sources. As of now, it can accept only text files. It expects the ETL tool to convert the data into the CSV/XML format.

You can define 'sophisticated' work flows to move a piece of data between states. One can provide action items(like email notifications), events triggering the work-flow and the different states of transition. Editing of data, raising an issue/change request is possible with this tool. So Kalido wins again.

Through Access Control Lists, Kalido implements security. ACLs dictate which sets of Users can access the data (at an instance of entity level) and what data they can access.

Probably the one area, which I am not thoroughly convinced is the Search & UI. It has a decent hierarchy browser and a neat search feature. Though it has .Net compatibility, certain basic UI features (like changing the font of the text if the data belongs to one particular market) is cumbersome.

Kalido truly lacks in the Data enrichment area. They currently don't have pre-built vanilla models, which might be useful for certain Master data like Product, Customer.

I haven't truly tested Kalido on a distributed network. Hence cant comment on it.

Overall, Kalido is an effective MDM solution

Saturday, January 05, 2008

MDM - Part 2 - Key capabilities of an MDM framework

Last year, one of my colleagues, was deployed to a leading FMCG company's workplace to understand their global reference data and consolidate it. She started off with interviewing the data management heads from various countries and after 6 weeks of tough grind, she came up with a very good logical data model. But when she started materializing her E-R model into the tool, she started facing problems. On further investigation with many of my colleagues who have worked on MDM implementations and also with my own experience, I have collated a few key points that are essential for the smooth running of an MDM engine.

Note: Broadly they have been categorized as Must-to-have(bold red) and Nice-to-have(bold green).

Data Governance and Stewardship: Identifying the right people to own the right data. This team is responsible for setting up the security access, correcting the erroneous data, defining the work-flow and acting on the notifications and submitting a report on the usage of the data.
Data Quality Management: Bad data is as good as not having the data at all. Processes and frameworks constantly working on the business rules, to furnish out sanity to the master data is a must. This is one of the most complex points in the whole MDM cycle. Does the tool possess adequate data validation techniques or does it rely on the ETL tool?
Flexible Data Modeling Capability: The tool should be as adaptive as the business process. A flexible data model to quickly prototype and develop is the ideal tool for such an implementation.
Integration Engine Maturity: The Data Integration drivers that get shipped along with the tool play a key role in the tool evaluation. Look for a tool which has good integration capability. Some of the tools stop with a flat file feature; though this might be be enough to start developing your repository, there might be added ETL effort if your sources are completely heterogeneous in nature.
Work-flow enabled Authorization Model: How does the authorization and the publishing of data happen? Is it through mails or through a sophisticated work-flow engine? Based on my acquaintances with the tools in the market, I have found that much of the MDM analyst's time in occupied in composing mails about the next action items that have to be taken on the data. This is where a tool with a 'cool' work flow feature takes its upper hand.
Security & Access control: Can the users of the Indian market control Australian customers? Probably yes, maybe no. Security and access driven capability of the MDM system is a must for an organization trying to consolidate its world-wide master information.
Search & UI Customization: In this search-driven world, (thanks to Google), a tool without search capabilities is a failure written all over it. The UI should be customizable and the framework should have inherent APIs to achieve the same.
Data Enrichment: Some of the tools have the means to integrate with the market research data vendors to enrich their data. A good example could be enriching the customer data for D&B related fields. Though this is not a MUST feature, it certainly is a feature for tool differentiation.
Service Oriented in Nature: SoA utilizes loosely coupled, reusable, and interoperable software services to support business process requirements. Though this is not very specific to the tool, it is more of a framework question - Can the MDM solution easily get positioned into the SoA architecture? For example, if the tool has capability to talk to different sources, integrate the data, present the data as services; YES it has capabilities to marry SoA.
Distributed system: This probably is one of the last items to be ever evaluated. If your master data runs into Tera bytes, then this feature of the tool might be worth visiting.

These 10 points sum up the different capabilities/components of an MDM solution. There are few other points like cost and platform dependency which I would like to place it to the discretion of the organization's policies.

Thursday, January 03, 2008

MDM - Part 1 - An Introduction

In continuance with my recent post on "MDM War", I would like to take you all into this enchanting world of MDM with a brief introduction.

In my own words , MDM (Master Data Management) is the single place where any kind of reference data would be maintained globally for an organization. All transactions and business processes would lookup to the services of MDM for their operations. Some of the important tpes of master data , that an organization would be maintaining are

Product

Employee

Customer

Location

Supplier/Vendor

As you can see, these entities can standalone and are independent of the business processes, that an organization would participate in. MDM allows the companies to consolidate the master objects which might be residing in silos, harmonize, enrich, and federate one common view of the organization's data to the businesses seamlessly. MDM, as misunderstood , is not THE TOOL which will do this magic. It still relies on people and processes to solve the puzzle. It provides the framework to achieve it without much fuss.

MDM is one application for the organization and not one for each business unit, though some of the services might be business-unit wise. For example, HR department wouldn't be interested in the Product Data and Sales Department wouldn't be much keen in the Employee's salary.

The supply chain in the picture gives a better example of how the different businesses in an organization would like to view the master data (Click the picture for better clarity)

In a nutshell, the key capabilities of an MDM tool are

Master Data Integration

Master Data Consolidation

Master Data Quality Validation

Master Data Enrichment (optional)

Work flow based Data Maintenance

Master Data Publishing

Any tool, which doesn't provide these features fails to provided a complete MDM suite. And one important thing, MDM has nothing to do with data warehousing.

Wednesday, January 02, 2008

MDM War

A market research firm estimates the market size of MDM related products to peak $ 1 billion by end of 2008. Though MDM is relatively a new zone of investment for most of the organizations, the immense value behind such a venture has been pro-actively noticed. Some of the companies which have their MDM product suite are

SAP - SAP Netweaver MDM
Kalido - Kalido MDM
IBM - IBM MDM (WPC, WCC)

Microsoft- Stratature
Oracle - Universal Customer Master.

One common aspect that could be found in these line of products are that most of them have been acquired. SAP launched its first MDM set of products in 2002 and had to quickly withdraw because of operational issues. Then it acquired A2i in 2004 and then repackaged A2i's product as its own. IBM acquired DWL Inc, to get hold of PIM (Product Information Management) and CDI (Customer Data Integration). Microsoft acquired Stratature and Oracle (with Customer Data Hub's not-so-good success) gained inroads into Siebel's UCM, post-acquisition.

These acquisitions have also been in line with the product suite that the companies already own. Oracle's merger with Siebel clearly will put itself on top, in the CRM and Customer Data Management space, while SAP can leverage its ERP customer base to bundle the MDM cake.

In the following weeks, I shall be understanding each of the tools in detail to find out which one has the killer technology and experience to be crowned the "MDM Maharaja". Or will they be all a bunch of Sultans aiming to just take a share of the $1 billion jewel?

Pages