Chapter 3. Methodology
In this chapter, I will describe the target application architecture, the process of defining the specific code to be generated through abstracting common functionality, and the tasks I need to accomplish in order to test my hypothesis. Since my modeling approach is less abstract than the entity-relationship model and is closer to a logical database model (i.e., a model based on tables, records and fields), I will use database terminology: I will refer to data objects, or entities, as “records;” and to their attributes as “fields.”
Target Application Architecture
The architecture of the target application is based on the Model/View/Controller design pattern, discussed in the previous chapter, and consists of three conceptual layers: data, business logic and presentation. The data layer consists of two parts: the data storage component and the application-level data access code, both of which will be described in this section. However, many of the data access requirements are set by the application’s presentation layer – which is why I will briefly describe these requirements.
Data Access Requirements Set by the Presentation Layer
Adding, modifying and displaying a single record is a trivial example of data access requirements set by the presentation layer. However, when dealing with collections of records, such requirements become more complex. Retrieving and displaying collections of records is a requirement for all data objects; it enables a user to browse or search through a set of records and select one or more records for viewing, editing, deleting, or manipulating in other ways. Figure 3 demonstrates a possible implementation, including the following features:
- Sorting. For convenience, records should be sortable by all fields displayed in the list in both, ascending and descending order.
- Filtering. The size of the displayed collection may be reduced by entering filtering, or search criteria.
- Paging. This feature provides the user with the option to view the collection “one page at a time.” This becomes absolutely necessary with large collections: it is not unheard of for a table to contain millions of records – displaying all of them on one page is impossible. The pager on Figure 3 is an example of complex functionality which can be implemented for all data objects by default – as opposed to the approach used in WebML and discussed in the previous chapter.

Figure 3. Displaying a Collection of Records.
The described features are usually combined and used whenever there is a need for displaying a set of records. Another example of this combination is displaying and modifying a collection based on a relationship. This is a very common requirement and it becomes essential even with a small collection of records: consider assigning multiple users to a role, or, visa-versa: multiple roles to a user. In this case, the collection includes an additional field, which indicates the existence of a relationship between each record and some record from another table.
Data Storage Component
Relational database as the data storage component. The code generation system described in this thesis assumes that the application's data storage component is a relational database. Turau (2002) argued that there is a current trend to use plain text as a format for storing knowledge persistently, quoting Hunt and Thomas (2000), who “believe that the best format for storing knowledge persistently is plain text.” Nevertheless, they acknowledge that there are drawbacks to this paradigm; for example, it may be computationally more expensive to interpret and process a plain text file. I agree with this concern. According to Codd (1970), the creator of the relational model, which is the foundation of relational databases, the number of possible criteria by which the data can be manipulated is very large: “In a large data bank each subset of the data has a very large number of possible (and sensible) descriptions, even when we assume (as we do) that there is only a finite set of function subroutines to which the system has access for use in qualifying data for retrieval” (p. 382). A relational database, compared to a text file, is by far superior as a tool for storing and retrieving data, because it allows “the designer to focus on the logical representation of the data and its relationships, rather than on the physical storage details” (Rob & Cronel, p. 58).
Database design. Discussing the design of a database’s logical model – i.e. the tables which represent data objects and relationships between them and store instances of both, data objects and relationships, as records – is beyond the scope of this thesis. However, I will briefly describe the approach I used to represent data objects and relationships as database tables.
There were three abstract data concepts I needed to represent in my database design: (1) a data object, (2) a one-to-many relationship, and (3) a many-to-many relationship:
- A data object is the same as an entity in the Entity-Relationship model and is implemented as a table, which I refer to as a record table. A record table contains records, each representing one data object. For convenience, each record has a unique numeric identifier, which is an automatically-incremented integer field. This identifier is used as the record’s “primary key” – a field uniquely identifying a record.
- A one-to-many relationship describes a situation in which an instance of one data object can be related to zero or more instances of another data object. This type of relationship is implemented with the help of a “foreign key” – which is a record’s primary key, stored in another record – thus, relating one record to the other.
- A many-to-many relationship describes a situation in which one data object can be related to zero or more instances of another data object and visa-versa. For this case, I use a table, commonly known as a link table, in which each record consists of two foreign keys. The primary key of this record is the combination of the two foreign keys – which, as an extra bonus, guarantees that there are no duplicate relationships.
In addition to tables, a database contains stored procedures, which are database-level data access methods. There are numerous reasons for using stored procedures in combination with application-level data access code, which include better performance, ease of modification and application security. Discussing these reasons is beyond the scope of this thesis.
Application-Level Data Access Code
The application-level data access code contains classes which provide a public interface for accessing the database directly or through its stored procedures. To show why each class is necessary in this implementation, consider a simple example of writing a user’s name to a web page. The following steps must be made: (1) connect to database, (2) retrieve the required record and store it in some data structure, (3) disconnect from database, (4) write the name fields of the record to the web page. An optimized implementation of this operation written in c# contains at least 15 lines of code! Obviously, such code should be written only once. Ideally, to display a user’s name on a web page, I would like to write the following code:
User myUser = new User(Page.Request[“UserId”]);
Response.Write(myUser.Name);
The first statement retrieves the identification of the record to be retrieved and instantiates a class which exposes the fields of a user record. The second statement writes the user’s name to the web page. Creating, updating, deleting and displaying sets of records should be just as simple.
This code simplicity can be achieved in multiple ways. My approach is based on abstracting the most common database access functionality into one abstract class, which I will refer to as the DataClass, which is then extended by concrete classes. In addition to the DataClass, there are several abstract classes, which serve as base classes for concrete “data containers” – i.e., wrappers for standard data structures which provide convenience features (i.e., “syntactic sugar”) for manipulating collections of records. These abstract classes include the following:
- IDataTable: an interface for a table of records;
- IDataRow: an interface for a record;
- IDataField: an interface for a record field;
- AbstractDataTable: a data structure holding a collection of records, which implements the IDataTable interface.
By using the described encapsulation approach, we end up with code which is much more complex and difficult to write and maintain, yet we are protected from run-time errors caused by incorrect direct references to the structure of data in the database. In the next section, I will show how all this code can be automatically generated – which eliminates any issues which would have been caused by the added complexity. The code of the described abstract classes is available in Appendix A.
Selecting the Code to be Generated
The application’s data layer contains a set of classes and database objects (tables and stored procedures), which provide data storage and data access functionality. In order to generate these items, we need to identify the parts which can be abstracted and generated based on an application’s data model.
Abstracting Code for Generation
Upon closer examination, it becomes apparent that the items in the data layer – the database objects, such as the tables and the stored procedures, and the concrete application-level classes – are nearly identical for any data object. The only differences are in (a) the names of the item, (b) the parameters, and (c) the set of data access methods. Following is a detailed description of these items. The sample code for some of these objects is available in Appendix B.
Database table for each data object and each data object link. A table is generated by executing a SQL statement. The only parts which need to be specified are the name of the table and the name and data type of each parameter.
Stored procedures for each data object and each data object link. Stored procedures, like tables, are generated by executing SQL statements. Each procedure is unique in terms of its data object name and, in some cases, its parameters. Data object links require an identical set of data access methods for their creation and deletion by various criteria. However, different data objects, require unique data access methods, based on their fields and, most importantly, their relationships with other data objects. This problem is the core of this thesis’s hypothesis, which I will explore in the next chapter.
DataField and DataRow classes for each data object. These classes implement the IDataField and IDataRow interfaces, providing access to a record’s fields and the data associated with a single record field (such as custom sort expressions, displayed titles, etc.). Only the data object name and parameters are different.
DataTable and DataList classes for each data object. These classes extend the AbstractDataTable class and provide access to each record in the table by exposing collections of DataRow classes and DataField classes. A DataList provides a collection of records with a minimized set of field. In addition to the data object name and parameters, data objects differ in terms of the set of fields which are displayed in these classes.
Record class for each data object. This class extends the DataClass and provides access to a single record, exposing its fields and the “update” method. The code differs only in terms of the data object’s name and its attributes.
RecordData class for each data object. This class extends the DataClass and provides access to methods operating on collections of records and methods, which do not require the instantiation of a single record class. It exposes the previously described stored procedures and differs not only in terms of its name, but also in terms of the set of stored procedures it must expose.
RecordLink class for each data object link. This class extends the DataClass and provides access to methods facilitating the creation and deletion of data object links. The code differs only in terms of the data object names which constitute a link.
List of Code to be Generated
In conclusion, I provide a list of specific items to be generated for the data layer of an application.
For each data object:
- a database table;
- a set of stored procedures facilitating data access;
- a DataField class for each data object attribute;
- a DataRow, DataTable, DataList, Record and RecordData class.
For each data object link:
- a database table;
- a set of stored procedures facilitating data access;
- a RecordLink class.
Experiment Design
This section describes the specific tasks I need to accomplish in order to test my hypothesis. These tasks can be divided into (a) implementation tasks, which deal with describing the data model, defining rules for data access methods and constructing the code generator; and (b) testing tasks, which deal with applying the implemented system to generating the code for real applications and measuring the results of using this approach.
Implementation Tasks
Describing the data. The approach I am describing in this study is to generate code based on the data model of the application. The data model has to be expressed in a way, which can be understood by the code generator. The data model must include data objects and their attributes, and provide a way to specify relationships between data objects, as well as additional information which will define the details of data access operations. Designing a data definition language for such a model is the first implementation task.
Deriving data access functionality. The previous section included a list of code items – both database-level and application-level – which need to be generated. However, some of these items mentioned sets of data access operations without enumerating them explicitly – which is the main point of this study’s hypothesis. Therefore, I must come up with rules which define the kinds of operations which will be generated for different data model combinations. Defining these rules is the second implementation task.
Implementing the Code Generator. Building the actual code generator, which will take as input a data model, described in the data definition language I design, and will generate the required code is the third implementation task.
Testing Tasks
Testing approach on real applications. The first step in testing the hypothesis will be checking the feasibility of the developed system on real applications. For this, I will attempt to generate the data layer for three data-intensive web-based applications. Each of these applications represents a distinct set of requirements, so the combination of the three will present a diverse scope of development requirements.
The first application, referred to as Witness Identification, is a web-based tool used in criminology, specifically – for eyewitness identification. A user (a witness) is presented with a sequence of head shots of suspects, selected from a set of several hundred thousand images. The application is relatively simple; its main challenge is manipulation of a very large set of data.
The second application, referred to as Account Reporting, is a more complex system – an online tool which provides university’s constituents with access to various university accounts. The system uses multiple databases and requires elaborate data access functionality to generate complex data reports.
The last application, referred to as PRSSA, is a collection of web sites with a complex content management system, which includes regular web sites, a blog, a career web site and numerous administrative functionality. The challenge presented by this system is the amount of different features it contains.
Measuring results. To determine how successful was my approach, I will conduct the following measurements for each of the described applications:
1. Amount of generated code. The following measurements will demonstrate the amount of the generated code in regards to the scope of each application. For each application, I will measure the following:
- number of data objects, data object links and data object attributes;
- number of generated database tables;
- number of generated database stored procedures and lines of code;
- number of generated classes and lines of code;
- total generated lines of code.
2. Effectiveness of approach. This measurement will demonstrate what part of the application’s data access code was generated. For each application I will measure the following:
- lines of code: total, generated, percentage of generated;
- stored procedures: total, generated, percentage of generated.
3. Efficiency of Approach. This measurement will demonstrate what part of the generated data access code was actually used in the application. For each application I will measure the following:
- stored procedures: generated, generated and used in application, percentage of used.
I consider a stored procedure to be an indicator of a data access operation, as opposed to a public method, because most of the public methods provide wrapper/convenience functionality and, therefore, are not essential for the application’s functionality. Stored procedures, on the other hand, are the actual link between the application code and the data stored in the database; they may be also described as corresponding to the primary data access methods, whereas all other methods are secondary and use the primary methods for accessing the data. By “used” I imply that the stored procedure were accessed from the application’s business or presentation layer, or from another part of the data layer, which, in its turn, was directly or indirectly accessed from the application’s business or presentation layer.