Chapter 5. Discussion
This study’s hypothesis stated that it is possible to build a code generator which will significantly improve development of data-intensive web-based applications by generating at least 50% of the data access code based on a specification of the application’s data model. To test this hypothesis, I designed an experiment, consisting of the following steps:
- creating an XML-based data definition language to describe the target applications;
- defining rules for generating data access methods based on the application’s data model;
- implementing the code generator;
- using the code generator to generate the data access code for three “real world” applications;
- measuring the results in terms of the amount of generated code, as well as the effectiveness and the efficiency of this approach.
All of these steps were successfully accomplished. In the following sections, I will discuss the major findings from this experiment, as well as some of the lessons I have learned from using this code generation approach.
Major Findings
The approach was tested on three “real world” applications with varying degrees of complexity. The results showing the amount of generated code serve as a good example of the general complexity of data-intensive web-based applications: approximately 1,042 lines of code were generated for each database table. These numbers confirm that there is a need for tools which will simplify development of such applications, and automatic code generation may be one such approach.
Effectiveness of Approach
The results of measuring the effectiveness of this code generation approach proved to be successful. In terms of the number of lines of code, 85% - 99% of the data access code has been generated automatically. In terms of the number of stored procedures, which may be also associated with the primary data access methods, 84% - 98% of the required procedures have been generated automatically. The Witness Identification application required very little customization in terms of data access functionality, so its data layer was almost entirely generated automatically, The Account Reporting application, on the other hand, due to the added complexity of the business logic of a financial system, had numerous customization requirements; therefore, only 84% - 85% of its data layer was generated automatically.
I conclude, that 41% of the Witness Identification application (approximately 12,000 lines of code), 42% of the Account Reporting application (approximately 21,000 lines of code), and 29% of the PRSSA application (approximately 42,000 lines of code) represent unnecessary complexity, which may be considered not very efficient.
Concerns About Efficiency
Despite the hypothesis being supported, the results of measuring the efficiency of this code generation approach raise some big questions. The PRSSA application and the Witness Identification application used only 31% and 35% of the generated primary data access methods. The Account Reporting application with a large amount of manually implemented code used as few as 20% of those methods. I did not calculate these values in terms of lines of code, because it was unnecessary: it is safe to assume that if 80% of the generated stored procedures were not required by the application, approximately 80% of the code was not required either (even though most of the code is encapsulated in private methods, this code supports the public methods, with only a small part of it supporting the entire application).
I conclude, that approximately 12,000 lines of code in the Witness Identification application, 21,000 lines in the Account Reporting application, and 42,000 lines in the PRSSA application represent nothing more but unnecessary complexity. These numbers represent anything but efficiency.
Observed Patterns
A detailed look at the generated methods and their usage revealed several patterns:
1. Methods dealing with a singe data object, such as creating, retrieving, modifying and deleting a record, are almost always used by the application when generated.
2. Only half of the generated methods dealing with data object links are used by any application: a link needs to be created, deleted, or generated for all values of one of the linking data objects.
3. When a collection of data objects is retrieved with paging criteria, retrieving that collection without paging criteria is only necessary as a list (i.e., a minimized set of fields).
However, these patterns may be used for only minor improvements in efficiency. The main problem lies in generating methods for retrieving collections of records based on relationships. There is no apparent pattern in the usage of these methods, and their numbers are overwhelming. In fact, one single data object, if having relationships with a few other data objects, may cause the generator to produce dozens of unnecessary methods (for example, the User object in the Account Reporting application, being related to four other data objects, resulted in 90 data access methods, only 19 of which were used.
Support of Hypothesis
Therefore, I conclude that the study’s hypothesis is supported. The presented approach, indeed, generates more than 50% of the required data access code automatically based on the application’s data model alone. However, while considerably improving development of such applications, this approach complicates the process, to a certain extent, by cluttering the application with unused code.
Lessons Learned
I have learned numerous lessons through conducting this experiment, most of which are either too specific in terms of implementation details, or represent tangent considerations. Nevertheless, I found some of these observations to be intriguing and relevant to the subject of this research.
Simplicity Versus Flexibility
The main thing I have learned from this experiment is that a code generation system, as well as, arguably, almost any software, is a trade-off between a flexible, yet complex system which allows the specification of numerous detailed criteria – and a rigid, yet simple system, which, in comparison, has most of the options hard-coded and generates code which is more standardized. It might appear that a flexible system is more desirable, however, my experiment proved to me that “keeping it simple” is, in fact, a better approach.
Argument for Simplicity. Take, for example, the problem of weak entities, which I briefly mentioned in Chapter 2. Vigna (2002c) suggests that to handle weak entities, there must be an identifying function from one entity to another, specifying the parent and child entities. Deletion of a parent entity implies deletion of all its children. However, I did not provide a way to specify weak entities in my schema. Instead, I added a parameter to the standard “delete” method my generator created which specified whether the system would delete all child entities upon deleting the current entity. In this case a simplified approach made the system more rigid, yet less complicated. In my opinion, in this case such a simplification was justified. Nevertheless, I must note that when the data model becomes more complex, my solution fails when dealing with “recursive weakness” – i.e., several levels of weak entities.
In general, the main drawback of a complex system is that as we move towards more complexity the data definition language’s abstraction level moves towards that of a general purpose programming language. And while it is tempting to provide the ability to specify in the data model as much as possible, we must remember that this is only a model and it is meant to be a simplification of the actual thing – because simplification is the only reason we are using it; alternatively we could simply manually implement the entire application, using an existing general purpose programming language.
Possibilities for Improvement
There are many possibilities for improving the code generation system I designed. In this section, I will briefly elaborate on some of these possibilities.
Better Data Definition Language Syntax. In regards to usability, or the convenience of using the system, the main issue I observed had to deal with the syntax of the data definition language. There were numerous issues that could be improved, yet the main issue was readability. The data model description is viewed from a monitor, therefore is should be as compact as possible. When designing my syntax, I decided to use XML elements instead of attributes to describe record fields, which dramatically increased the size of the specification and made it hard to read: a definition of a class could span multiple screens. The usage of attributes would have significantly improved this. Consider the example in Figure 7.
Dealing with Derived Fields. Some record fields have values which are derived from other fields or are calculated by the database. For example, a column like “total price” in a shopping cart application might be derived from other columns, such as price/tax/shipping, etc… Another example would be a modification field, storing the date and time when the record was last modified. This field can be updated by the database each time the record is updated.

Figure 7. Better Data Definition Language Syntax.
Specifying these kind of fields in the model proved to be a challenge. I came up with several types of readonly fields – such as created and modified – and specified such a field as a readonly field. However, a more general and, therefore, better solution would be to specify a value of a field. If a value is not specified, the field is treated normally. If a value is specified – the field is derived based on the specified value. The value could be an SQL statement (for example, “firstName + ‘ ‘ + lastName” for a full name), or a global function like “getdate()”.
Defining Data Views for Retrieval Methods. Finally, the most important improvement could be changing the way collections of data are specified, as well as the rules for their retrieval.
It has been shown that there are numerous ways in which collections of records can be retrieved. Assuming that paging, filtering and sorting functionality can be applied to each of these collections, their retrieval methods may differ in two respects: (1) the criteria used to retrieve the data, and (2) the set of fields retrieved. Chapter 4 has demonstrated that generating an exhaustive set of data retrieval methods is very inefficient. Handling the requirements in regards to different sets of fields included in the retrieved collections has been handled with the help of the IncludeInList, IncludeWithParentTable and ExcludeFromTable schema attributes, which proved inadequate, as the exceptions to these rules kept multiplying as the requirements became more detailed.
An alternative solution was offered in the WebML framework, which I described in Chapter 2. WebML offered the concept of content units, used to model the “composition” of the application’s data – i.e., how it is presented to the user. After conducting my experiment, I am convinced that the WebML (Ceri et al., 2000) approach is superior to mine: explicitly defining content units for each data object might simplify development by making the code much cleaner.
However, instead of using the content units, proposed in WebML, I would use a simplified concept, which may be described as a data view. Similar to WebML’s data unit, a data view might represent a particular “view” of a collection of data objects. For example, a User data object might have a minimized data view for displaying in drop-down lists, an all-inclusive data view, a data view with or without specific data fields, etc... Such a data view would include a references to a data object and a data retrieval operation, an ordered set of retrieved (or derived) data fields, as well as filtering and paging criteria.
An additional benefit of such an approach would be separation of concerns. In my approach, I have to use the data model to specify both – the data itself and how it is structured, and the details of how that data is displayed. The data view approach would eliminate this problem: views would reference data objects and their attributes from the data model and would specify how they should be displayed for the user. Besides, these specifications could be unique for each view.
Furthermore, assigning such data views to explicitly defined data retrieval methods could, potentially, solve the efficiency problem of the approach used in this research. Therefore, implementing this new solution, in my opinion, should be the first step in further research.