Thursday, November 1, 2007

The Data Warehousing Applications of SIF

In SIF, the concept of interoperability has traditionally been applied to applications sharing data with each other. Each application in a Zone has the ability to publish data for which it is authoritative. Other applications in the Zone subscribe to that data as needed. For example, some student information systems subscribe to food services and transportation data. Almost all SIF-enabled applications have the ability to subscribe to the student demographic data supplied by the SIS.

As data warehouse solutions become more prevalent in schools it is worth discussing how they can interoperate with SIF. One of the most labor intensive components involved in implementing and supporting a data warehouse are the extract/transform/load (ETL) processes. ETL processes are responsible for mining data from operational systems (the “E”), modifying and scrubbing the incoming relational data to fit a dimensional model (the “T”), and then placing that data into the dimensional model or a staging area (the “L”). Whether the data warehouse has been built from scratch using industry standard tools or it is a commercial product, significant challenges lie in getting intimate with the native data structures of the operational systems. Over time, operational systems schemas change. This can add a potentially significant support burden to the data warehouse, just for the ETL component!

SIF has the potential to ease the ETL burden for data warehouse initiatives. First, SIF provides a standard data model that most leading vendors in the K12 space have adopted. In other words, the SIF Agents for those systems conform to a standard definition for data groups like student, staff, assessment, special programs, and finance. The SIF data model can thus be leveraged for ETL staging. Just be aware that SIF does not (yet?) provide any standards for dimensional data models.

Second, SIF provides a standard transport method for mining data out of the operational systems using XML over HTTP/S. Traditional approaches to ETL involve using ODBC, JDBC, file dumps, and other one-off approaches for accessing operational systems’ data. SIF’s Request/Response and Event models provide a much simpler way to load and maintain synchronization of operational data in a data warehousing environment. A “universal subscriber” Agent that stores incoming SIF data in its native XML representation or that “shreds” the XML into a relational staging area can be implemented for this purpose. The major advantage of the “universal subscriber” approach is that it essentially implements one interface to a Zone, versus implementing an interface per operational system as would be necessary with traditional ETL tools.

Another caveat that must be addressed if you consider using SIF for ETL into a data warehouse is data quality. SIF provides a powerful, open mechanism for moving data. However, one must never forget the GIGO principal (garbage in, garbage out). As applied to SIF, GIGO says that if bad data is entered into an operational system, then bad data will be in ALL subscribing operational systems AND the data warehouse. How can this challenge be addressed?

I believe it starts with business processes. Organizational standards for data entry must first be defined. The people responsible for data entry must then be trained on those processes. An even better approach is to have them participate in development of the standards in the first place. There must also be systematic checks of data before it is loaded into its final destination, the dimensional model. The data quality checks can be implemented at various points in the process. The previously mentioned “universal subscriber” could be made responsible for some of the necessary data validation. The process that loads the staged data into the dimensional model would also likely share in the responsibility of checking data quality.

SIF can be a great tool for inter-application data sharing. It can also play an important role in data warehousing initiatives by streamlining and simplifying ETL via a standard data model and data transport.


No comments: