Senthil On Data

Tuesday, June 21, 2011

As an architect...

I have been playing the role of a DW Architect for a large financial services organization over the past 6 months. When I reflect back on my duties and responsibilities, one aspect came out very strong that gets missed out in the job description.

"Leaves his Unique Selling Proposition trace in the project"

There are often a lot of other crap that the organizations look for like..Coordination between the business and IT....Owner of data models....Provides strategic direction to the IT team...etc...

Do you think Emperor Shah Jahan would have given a big job description to his architect, Ustad Ahmad Lahauri, when he set out to build the Taj Mahal? He described the Taj Mahal as

Should guilty seek asylum here,
Like one pardoned, he becomes free from sin.
Should a sinner make his way to this mansion,
All his past sins are to be washed away.
The sight of this mansion creates sorrowing sighs;
And the sun and the moon shed tears from their eyes.
In this world this edifice has been made;
To display thereby the creator's glory.

If you read the last line, it stresses the need to display the creator's glory. An architect should leave behind his glory after he is long gone from the project. That should be the true job description. It should be the one thing that the people would still talk about after he has resigned. The rest of the responsibilities are just enablers for the ultimate glory.

Personal Analytics - Excel & Qlikview

Everyday, we take decisions, either at work or at home or during transit. How do we take our decisions? When we take decisions at work, we use "saved" data to help us out. But at home, we rely on our primary memory to help us out. We also rely on reviews on the web to help us take our decision. We rely on "Facebook" likes. But somehow, I am still not convinced that this will help us to make a decision that will be the closest to the best. So I decided to maintain my own personal analytics database. My first "personal" problem was finding the right health insurance for my dad. So I decided to make my own BI environment to help me take this decision.

I used Excel for feeding & maintaining data. I profiled attributes of all kinds for this.
I used Qlikview for BI analytics.

It is turning out to be a great exercise for me to find which insurance is the right one for my dad.

Tuesday, June 14, 2011

Metadata Management - Scratching your own itch

Meta-data management is always a complex problem, because its about capturing "data about data". I don't think DW as an industry leader is anywhere near on making sure organizations are cleansed of bad data. We still have bad data and if it is about capturing finer details about this "bad data", its even more bad.

My current customer has this huge problem of not knowing how the data elements get mapped between different hops in the data warehouse (Staging, DW, Datamarts, Business Objects Universe and ultimately the "Requirements"). There were lots of discussions carried on what kind of tools to procure and the profiles of meta-data architect to recruit. We were getting nowhere.

We decided to scratch our own itch. We started this exercise 3 months back in our past time to start documenting the data lineage in a simple "denormalized" spreadsheet. It took us time. Layer by layer, source by source, we did it. After 3 months, we had a full blown spreadsheet which captured the complete data lineage and the business rules implemented in the DW layers.

When we reflected back, we realized a couple of "eye-openers"

1. Don't invest upfront in meta-data. Ask your existing team to start documenting in the easiest and most flexible manner possible. Probably a spreadsheet. Take it one step at a time.

2. And start small. Get the data first and then think about tools. Check if the data helps you to make your job any easier or the business user's job any productive. If not, the meta-data program is not for you.

Tuesday, June 07, 2011

Design Documentation - Batons in a relay race

Documentation is always like a "baton" in a relay race. You care for it, only when you pass it over to somebody else, till that time you don't even bother of the existence of such an entity. If you have to make a baton interesting to a running athlete, you need to stuff it with something that is interesting to him; probably some sort of energy drink, which he will consume it whenever he becomes exhausted. The beauty of the baton design is that it has to be light enough so that its not an extra burden to the athlete and designed in such a manner so that it enables for a quicker passover between the athletes.

I had always stayed away from "Documentation", because I never found a usage for it except during audit or knowledge transition. And even in audit, nobody cares for the quality of the document; the auditors just check for the existence of the document.

So to make some sense of this complex phase in an SDLC, I decided to derive the "Baton" analogy to documentation, because just like a baton, without any documentation, I can never say that somebody completed a relay race. To make documentation competitive, I decided to do the following: Make it -

1. Light - so that its light and easy to carry.
2. Interesting - so that the consumer opens it and uses it frequently (and)
3. Do its job - so that the transition/passover is easy

Light

Why is iPad 2 thinner and lighter than iPad? Simple. They moved from 2 thicker batteries to 3 thinner batteries. They made the whole shell using carbon fiber. Not that I understand the material composition of carbon fibers to comment on it, but if you reduce the content, it becomes lighter. So essentially, I will try to make my document as light as possible. "Fewer pages" will be my KPI. So I decided to budget a page count for every kind of document and focus to convey whatever I wanted to convey within that page count. And I have a motherhood rule "Don't cross 15 pages of content".

Interesting

For anything to be interesting, it should be useful. Information in the document should be useful and should in turn help the consumer/creator of the document. So to make it interesting, I decided to add 3 simple sub-sections for every section that I created - What, Why & How ? So, if I have to design Change Data Capture in my ETL (some data warehousing terminologies) process, then I add What is CDC; Why should I use CDC?; And how should I enable CDC?

Do its job

It should do its job. It should enable transition. So, if the successor wants to understand what the predecessor did, the document should be able to convey. It shouldn't be lost.

So if a design document addresses the above 3 philosophies, it has met its purpose of existence.

Saturday, December 19, 2009

Can BI Strategy ever become real?

When a super BI consultant recommends a whole stack of To-Do items in the information curve, how real is his recommendation? How real is the BI Strategy? 30%, 50% or 70%. Have you as a customer asked your vendor about it? Ask it. The typical response would be - "Depends".

The problem is not with the BI consultant; the problem is with prediction. BI is such a game where the variables are too many - money, business benefit, pain problems, information maturity, tool consolidation, vendor proliferation, data volumes, system integrators, application support, advanced visualization, data conformance, data quality, stewardship, etc....The list just goes on and on. BI Strategy almost turns into a weather forecasting system. So is there no answer to being real? There is. Answer is "Stop doing it. Get Real."

BI comes with a cost. Its not something that you can purchase it during a sale. BI is something that every organization needs. It has become ubiquitous. A strategy is just a sales tool to your governance board for approval. Do you need one? Why do you want to spend on a sales tool to prove that it is required for your organization? Would you construct a business case for seeking an admission for your son or daughter into the IIMs or the MITs of the world. Instead spend it on building a 60-day data mart. Make the users use it for a month. After a month, pull the plug off. The # of calls you recieve to get the system back would talk about the ROI of BI.

Let me know what you think.

Sunday, July 05, 2009

Operational BI - Part 2

Having set the need for an Operational BI in my previous post, I will sketch out the architecture of an Operational BI solution.

The four important blocks to be considered while designing an O-BI system are

Sourcing/Extraction Module
Transformation & Load Module
Data Retention Module
Reporting Module

Sourcing/Extraction Module discusses the extraction strategy from the source. This Module covers the change data capture (CDC) design & data transfer mechanism. Choosing the right extraction strategy would spell success or failure for the project. Almost 70% of O-BI projects fail because of the wrong sourcing strategy implemented.

Transformation & Load Module discusses the kind of loading tool-set that would suit an O-BI system. Details about the expected load volumes, the loading patterns and the hand-shaking mechanisms with the source will be discussed

Data Retention Module discusses about the parameters required for estimating the size of the sliding data storage windows.

And finally the Reporting Module discusses the kind of reports that an operational executive would need for taking his tactical decision on a hour-hour basis.

These sections would be discussed in detailed in my further posts.

Saturday, April 18, 2009

Operational BI - Part 1

The genesis of BI has always been the need to seek for the BIBLE of decision making. But BI over a decade has transformed itself from a night watchman to more of a 24/7 call-center representative. It has become real-time. What made this change? Why was the mutation to real-time necessary? What are the challenges in data integration? And finally, how can Operational BI (O-BI) be coupled with the Enterprise Analytical Reporting framework? I will be assessing each of the questions posed in greater detail and arrive at a design pattern for modeling a Operational BI Solution.

Let us drive the need for implementation of an operational BI solution with an example.

A store manager at a retail outlet manages various aspects of retailing - visual merchandising, customer experience, resource scheduling, loss prevention, product management (ordering, receiving, pricing, inventory). Let me explain each one of these facets of the retailing business briefly.

Visual Merchandising: Promotion of the sale of good through visual appeal in the stores (source: Wikipedia).

Customer Experience: Reduced customer wait-time in the check-out counters.

Resource Scheduling: Monitoring the efficiency of the employee schedule for improved load balance of employee work-hours.

Loss prevention: Real-time monitoring of 'shrinkage' because of shoplifting, employee embezzlement, credit card fraud, system errors and many more.

Product Management: Real-time monitoring of product inventory.

Given this background, I would proceed on to connect all these process areas with a business case that would put a Store Manager in trouble and how an Operational BI solution can save his day.

Let us assume that the Store Manager has access to a reporting solution which refreshes once in a day. He notices that the daily sales has dropped as compared to the previous day. He drills further down to investigate the cause of the decline. He finds out that the drop can be traced to one particular hour in the day. A deeper look into the problem highlighted the issue of an increased average customer wait-time per hour causing a poor conversion rate. The wait time finally was attributed to reduced work-force in that hour because of an increased lunch break taken by the employees (since they turned up very early to work).

This problem could have been easily rectified if the store manager had access to data earlier than what he had. Had he had real-time access, he would have noticed the dip in sales for that hour immediately and would have taken corrective action, thereby not affecting the sales during that hour. With a decent business case established for a real-time BI system, let's analyse what an operational BI is and how does it facilitate to solve the problem.

The architecture of Operational BI and the challenges associated with it will be posted in the next article.

Tuesday, October 21, 2008

Infobright's column based datawarehousing

Infobright, a open source data warehousing start up addresses the performance issues that usually come along with a data warehouse by implementing a highly compressed column-oriented store. The data is stored in columns instead of rows. This allows for reduced I/O because of the compression ratios obtained on the columns. Data is stored as 65K blocks or nodes containing a meta data store about the relationships between columns.

Some of the key customers of Infobright are RBC Royal Bank and Xerox. They claim their product would be ideal for data warehouses ranging from 500GB to 30TB. Their compression ratios are close to 40:1 according to their community blogs. The most attractive feature about them was the compatibility with the existing Business Intelligence tools like Business Objects and Pentaho.

I wasn't very convinced with the concurrency offered by them. It supports 50-100 users with 5-10 concurrent queries. I will watch for the progress of this new exciting player in the already crowded BI market.

Thursday, July 10, 2008

Release Early Release Often

Release Early Release Often (RERO) technique proposes to have releases early and often, instead of a big bang release. This approach is typically followed in tech startups, working on Open source projects. That’s the reason we see many of Google’s products still in beta version and their updates getting released once in a month or so. We planned to experiment the strategy for a big Master Data Management (MDM) project. The experimentation turned out to be successful. The rest of the essay discusses the experience details of such an implementation.

User Thrill

Important features of the application were phased out for various distinct releases. Some of them were Hierarchy & Workflow management, Security and Exception reporting. And the duration between releases were as close as 2 weeks. That meant, the user saw features getting added once in 2 weeks. We captured the user feedback about the releases and made sure we corrected it in the immediate ones. This approach had a two prong benefit. User experienced the application very, very early and we experienced the bugs. By the time, the UAT phase reached us, the application had reached a near-to-zero defect zone. We were a bit skeptical whether the user participation would be high, but since the product was there to be played with, it naturally attracted them.

Incremental Application testing

The application was getting tested from the day the first beta was released; rather from the “Go Live” day. Although this created few negative impressions on the user experience due to few unpleasant bugs; they knew that it was in its beta stages and the next release would have the patched version. In fact, our testing team grew from a 3 member team to a 6 member virtual team (There were 3 business users).

Support framework

To enable such a dynamic release process, the revision control and the code review/release systems should be efficient; there would be multiple releases instead of one. The integration testing should be solid. And the unit testing before the releases should be good enough not to distract your users completely; dissolving the purpose. Meticulous planning of the releases will also form a key to the success. The development tools that you use should be agile and adaptable enough to accept and implement the user’s feedback for the next release.

Conclusion

The experiment turned out to be a success. This strategy would work for most of your implementations, unless it’s a maintenance project with less than a week’s duration of deliverable.

Wednesday, June 25, 2008

Which MDM approach is right for you?

MDM, in the past 5 years, has come a long way in its maturity model. Most of the MDM implementations fall under 2 different kinds of approaches.

Operational MDM (the tougher among the two)
Analytical MDM

Operational MDM enables synchronization of master entities and their attributes between the transaction processing systems. Why does one need such an MDM? Let's take an example. ABC Corporation is a manufacturing firm. It conducts roadshows and marketing campaigns to advertise its products. The salesperson collect customer information during those roadshows and feed it into their IT systems for further followup. There are a different set of sales representatives who conduct feedback on their products sold, with their customers. They too enter the customer feedback into their IT systems. These are 2 different sets of CRM processes.

Typically what happens in a mature company is, there are a set of batch processes which pick up the master data from one system and transfer it to the other. Now this introduces delay, inconsistency, inaccuracy of data and lot of manual reconciliation (same customer name can be entered by 2 different salesperson or the latest survey from a salesperson can erase previously collected information about the customer). So the IT develops custom programs to clean up the data, write reconciliation programs but still cannot manage to do all this in real time.

This mess can be reduced or eliminated by deploying an operational MDM. Operational MDM tools solve the synchronization problem using complex match-merge algorthims. Some of the tools currently in the market are Siperian, IBM, Purisma, Oracle and SAP.

Analytical MDM is an architectural approach if the problem revolves around inconsistent reporting for business performance management. In simple terms, inconsistent hierarchies are getting reported out. This needs for a unified reporting view of the master data. The audience for this system would be the downstream data warehousing and business intelligence applications. Some of the MDM vendors selling their expertise in this area are Kalido, Oracle, IBM.

It is essential that an organization has to build both these models to address their MDM needs. But which one to chose first depends on which problem is in their high priority list.

Pages