Data source preparation, an obstacle in report developing

To date, the reporting tools have been developed pretty well. We have embraced a lot of emerging tools like Qlikview, Tableau, Spotfire, and still have the choices of classic tools like Sap Visual Intelligence, JasperReport, and Cognos. However, in developing the report, there is still a key obstacle – data source computing.

The goal of data source computing is to compute the convenient and usable result set for report. SQL/SP is the traditional method to perform the data source computing. But it is a bit difficult for SQL and SP user to design, debug, and migrate. What’s worse, with regard to the computing with various databases and non-database data sources, this traditional method becomes useless, let alone the JasperReport and those tools that is restricted to use a single data source only. To override the last obstacle, we need to employ some other techniques, and the typical ones include data warehouse, data source computing layer, semantic layer, table-joining, and API programming. I would like to share my viewpoints on the respective pros and cons of these methods according to my personal experience.

Data Warehouse. As a way to build a single consistent data view with ETL, data warehouse offers lots of relevant tools that is already proven and well-developed, for example, Teradata Fastload, DB2 data warehouse, and Oracle Warehouse Builder. The data warehouse has many advantages: cross database computing, non database computing, historical data tracking, high performance, and low pressure on business database.

The major drawback of data warehouse is the high cost that mainly resulting from the building and maintaining, but not the product purchase. The data warehouse requires the double expert who is skillful in both IT and business. They must be excellent in Perl/VB Script/JAVA/SQL/SP and business logics. The data warehouse users have to perform the complicated data splitting and consolidating, design the norm form and the algorithm for incremental update, and prioritize the task scheduling across multiple levels. To keep the data warehouse efficient and usable, more than one ETL processes are required. The data warehouse may easily get failed due to the uncontrollable cost.

Data source computing layer. As an emerging method in recent years, the data source computing layer tries to root out the problem in data source computing through a common method. With the computing scripts designed for the structured data and semi structured data, the data source computing layer supports data retrieval from heterogeneous databases and non database data sources, and even the interactive computations between heterogeneous data sources. In addition, it offers the professional development tool and perfect debugging tool to reduce the difficulty of complex computing. All in all, this method is more efficient than difficult SQL/SP. As for the portability, this method is better because it is independent of any specific database or report by nature.

esProc and R are the typical data source computing layers. esProc can output the result via JDBC for direct referencing by report. But esProc does not support the correlativity analysis and other models. Contrarily, it is awkward to invoke R in the report, though R supports various fixed analysis models.

Typically, the full memory computing is adopted in this method. So, its performance on standalone machine is poorer than that of the data warehouse. However, both esProc and R are excellent in handling the distributed computing because they all support the big data computing model like Hadoop.

Semantic Layer. The semantic layer was introduced when the data warehouse becomes disappointing. The semantic layer redefines the physical data model used to be quite difficult to understand. Regarding the semantic layer, the database\text\Excel\XML will not be differentiated strictly. So report designers can operate the data in a more user-friendly way. Sap Universe and Qlikview are the most characteristic ones.

Both the semantic layer and the data warehouse are designed to build an easy-to-use data source, and thus their advantages are basically the same. However, building the semantic layer consumes fewer resources and less time because it requires less procedures and comparatively more agile. It is established and maintained at the cost higher than the cost of data warehouse but less than that of data source computation layer.

The drawback of semantic layer is mainly the insufficient computing capability. This layer works with the wizard as the core, complemented with some elementary functions of data source consolidation and computing column. However, it lacks the special-tailored structured data computing script. Therefore, the semantic layer can hardly perform the data source computing independently. Additional assistance from simple data warehouse or ETL is often necessary. The full In Memory computing is also weak in performance. Lastly, the semantic layer is often bundled with reporting tools and thus poor in portability.

Table  joining. This is one of the basic functions of semantic layer. In many reporting tools, it is defined separately as a simple way to solve the multi data source problems for the single source report. For example, the JasperReport and the Birt only support a single data set. If the data from Oracle and MSSQL are to be represented in a single report at the same time, then you can take advantage of their table-joining functionality.

The table joining is superior to other methods for the lowest workload and human resource costs. But, there are comparatively more feature restrictions. These restrictions include: only the easiest joining can be performed, the normal data source computing is not allowed; insufficient support for the non-database data source; the bundled reporting tools and database are not easy to migrate; The performance is quite poor due to the full in memory computing;

API Programming. It is the reporting tool API to compute the data source , and return the result set ultimately. Most reporting tools provide such methods as CDS frame of SAP, and Bean data source of Jasper.

This method has the utmost flexibility, capable of solving any data source computing problem theoretically. Among all these methods, this method involves the greatest development complexity.

In order to access the non database data source, Java and other advanced languages is generally adopted in this method. These languages may require chunks of code to implement a simple algorithm, which is unlike the languages designed for data computing, such as SQL, esProc, or R language. So, even a simple algorithm could be quite difficult to implement and require great workload. In a matter of fact, its existence value largely lies in its role as a complementary method to the semantic layer or table-joining.

In conclusion, the above 5 methods are fit for their respective scenarios: the table joining is fit for the simple joining operations between databases; the API programming is fit for the simple non-database computations; the data source computation layer is fit for the computation requiring great portability or the complex data environment; the data warehouse is fit for the LAN environment with great performance demand; and the semantic layer is fit for those business personnel processing massive amount of simple reports.



About datathinker

a technical consultant on Database performance optimization, Database storage expansion, Off-database computation. personal blog at: datakeywrod, website: raqsoft
This entry was posted in Reporting tool and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s