Research > MS Thesis > Background and Related Work

Chapter 2. Background and Related Work

In this chapter, I will provide an overview of the research problem: I will describe data-intensive web-based applications, as well as some implementation issues, caused by recurring functionality requirements in such applications. I will show that automatic code generation is a possible solution to these issues, after which I will explain why it is reasonable to generate only part of the application, and why the data access functionality may be the optimal choice. I will also provide a critical review of several existing code generators and approaches to modeling data access functionality.

Definition of Data-Intensive Web-Based Applications

The subject of this thesis is data-intensive web-based applications. There have been many definitions offered which separate data-intensive applications from other types of web-based applications, or web sites. Fratenali and Paolini (2000) define these applications as characterized by high volumes of data to be published and maintained over time (p. 325). According to Merialdo and Atzeni (2000), data-intensive web sites are large sites based on a back-end database with a fairly complex hypertext structure (p. 50). According to Jacob, Schwarz, Kaiser, and Mitschang (2006a), data-intensive web sites mainly focus on making large amounts of data available on the web (p. 77). Similar to this definition, Ceri, Fratenali, and Matera (2002) suggest that these applications' main purpose is presenting large amount of data to their users (p. 20). Other researches offer similar definitions.

While all of these definitions correctly describe this type of application, using such criteria as “large/high volume amounts of data,” or “complex hypertext structure” may be misleading. For example, a web site, displaying its content on a single web page can be, in fact, data-intensive – provided that this content is stored in and retrieved from a database – although it does not present large amounts of data to the user. On the other hand, a web site containing thousands of static HTML pages, clearly, presents large amounts of data to the user; however, I suppose, it cannot be described as data-intensive, since this data is not retrieved from a database or generated dynamically and, therefore, does not require any data-related operations. The amount of data also cannot be used as a qualifier, for it is hard to define a boundary between intensive and not intensive in terms of bytes, characters, or web pages.

Still, these applications clearly stand out from the rest. They are not composed of pre-coded static HTML pages: their content is served from a data repository – which may or may not be a database – with web pages generated dynamically. This causes a significant part of the application’s code to be designated to support this requirement – which justifies, in my opinion, the usage of the relative amount of data access code as a qualifier for such systems. The three applications used in this study have between 19% and 38% of their code handling data access; although, it has been my experience that data access code can represent between 10% and 80% of the application’s code. Considering these numbers, I conclude that for an application to be considered data-intensive, at least 10% of its code should handle data access.

Therefore, for the purposes of this research, I will define data-intensive web-based applications as systems, which require comprehensive data access functionality for providing web-based “read” or “read/write” access to data stored in a data repository, such as a database, with at least 10% of the application’s code handling this functionality.

Automatic Code Generation as a Way to Improve Development

The implementation of data-intensive web-based applications contains numerous recurring functionality requirements, which lead to repetitive coding patterns and, often, duplicate code.

One solution to these issues is to factor out as much common functionality as possible into abstract classes, which can be reused, with concrete classes implementing application-specific details; this approach, potentially, can minimize the amount of repetitive work and duplicate code. However, experience has shown that every new application has its own unique requirements, which leads to updating the abstract classes with new code each time they are reused. As a result, the abstract code, while catering to “the most common denominator,” very soon becomes unnecessary complex, and ends up supporting numerous functionality requirements, only a small part of which are used by any single application.

A better approach is to automate development by generating the repetitive part of the code. Although generated code contains numerous repetitive elements, it does not lead to any of the problems caused by code duplication: any edits are made to the specification, whereas the code itself is never manually altered.

Benefits of Automatic Code Generation

Automatic code generation offers numerous benefits. According to Jacob et al. (2006a), “creating the specifications of an application is much less time consuming than creating the equivalent code” (p. 78). Whitehead, Ge, and Pan (2004) noted that code generation reduces monotonous work: programmers only need to build models and leave the tedious coding to the generator; “this enables developers to focus on more important development efforts like the domain logic and user interface design” (p. 205). These development efforts can be considered more important simply because they cannot be automatically designed by a machine. In addition to this, by using automation, developers avoid the error-prone process of manual refactoring. Whitehead et al. observed: “to add new features you simply change the model and regenerate the code; then make minimum modifications to the presentation layer” (p. 205). Cleaveland (n.d.) offered a detailed overview of the benefits of using a code generator:

  1. Specification Level versus Code Level … Specifications are much easier to read, write, edit, debug, and understand than the code that implements the specification…
  2. Separation of Concerns: all too often software is constructed with different concerns all jumbled together… Generators provide a way for separating [these] concerns.
  3. Multiple Products: In … typical situations, many other files [in addition to the code itself] may also be generated, including [documentation, test scripts, diagrams, and simulation tools.]
  4. Consistency of Information… All too often, fixing a bug or updating software introduces other errors, because a “piece” of information was not updated consistently across the whole product. In a software-generation approach, one simply updates the specification and regenerates the software.
  5. Correctness of Generated Products: Many program generators create thousands of lines of code that are far more reliable than if they were hand crafted.

To sum up, a code generation approach offers the advantage of instantly generating code, which is optimized, consistent and thoroughly tested. But most importantly, it increases productivity by enabling developers to focus on the more important areas of the application, which cannot be generated by a machine.

Selecting Code for Automatic Generation

It has been noted that only a small part of application functionality can be generated by a machine (Glass, 1996). To have something generated, we must describe to the machine what exactly we expect it to generate. Describing a piece of code in a way that a machine can reproduce it, quite obviously, is more complicated than just writing it. Therefore, using this approach makes sense only if we had to write the same code multiple times. Thus, the first step in automating development is identifying recurring code patterns, which can be done through analyzing the development process decomposition of the selected type of applications.

Conceptually, data-intensive web-based applications can be described as consisting of three layers: data access, business logic and presentation. This decomposition is based on the Model-View-Controller (MVC) design pattern (Java BluePrints, n.d.), which separates the core business model (“model”) from the application logic (“control”) and the presentation (“view”). The MVC pattern decomposes the application into three distinct layers which have their specific responsibilities: data, application logic, and presentation. The data layer consists of the data storage component (in most cases – a database) and the code for accessing (i.e. storing, retrieving and modifying) the data. The application logic layer implements the business rules of the application and is responsible for processing the data. The presentation layer is the user interface which may be a web site, a desktop application, a console application, etc.

The application business logic layer is unique to each application and does not contain any evident recurring patterns, so its code is not a good candidate for specification and automatic generation. The presentation and data layers, on the other hand, contain multiple recurring code patterns – which makes it possible for parts of this code to be abstracted and generated automatically. In order to identify the parts most suitable for automatic generation, I will look at these two layers in more detail, while consulting some of the existing research in this area.

Overview of Modeling Data-Intensive Web-Based Applications

A code generator is “a program that translates a domain specific language or specification into application source code” (Whitehead et al., 2004, p. 209). Therefore, without loss of generality, the process of code generation can be described in terms of (a) modeling the features to be generated, and (b) translating the model into code. The second part is a straightforward process and will be briefly examined in the methodology chapter of this thesis. The concept of modeling, on the other hand, is more involved and presents numerous design and implementation options, which are being actively explored by scholars and developers alike.

Most of existing systems model both, the data and presentation layers of a web-based application, decomposing the two layers into more specific conceptual parts. WebML, or the Web Modeling Language, introduced by Ceri, Fratenali, and Bongio (2000) and extended by Ceri, Fratenali, and Matera (2002) is one of the more comprehensive solutions. This modeling framework consists of four perspectives: (1) a structural model, which is used to describe the underlying data; (2) a hypertext model, which consists of a composition model, used to specify all the ways in which data might be displayed on the page, and a navigation model, which expresses how pages are linked; (3) a presentation model, which describes the graphic appearance of pages; and (4) a personalization model, which allows the modeling of users and groups and their relations to user- or group-specific data and settings.

The AutoWeb system (Fratenali & Paolini, 2000) takes a very similar approach in implementing the HDM-Lite modeling framework, which allows the description of a web application by a schema in three parts: (1) the structure model (i.e. data model), (2) the navigation model, and (3) the presentation model. Other modeling frameworks take very similar approaches: Bochicchio and Fiore (2004) specify (1) an information design (i.e., the data model), (2) a navigation design, (3) a publishing design (i.e. presentation model) and (4) an operation design (which is a model of the data access requirements). Jacob et al. (2006a) and Jacob, Schwarz, Kaiser, and Mitschang (2006b) describe a content model (or data model), composition and navigation model, and a presentation model. Similar approaches are offered by Merialdo and Atzeni (2000), Milosavljevic et al. (2002), Zhang, Chung, and Chang (2004), Jensen, Tolstrup, and Hansen (2004) and Turau (2002), with only slight differences.

To sum up, existing code generation systems facilitate modeling of data-intensive web-based applications along the following lines: (1) the structure of underlying data, (2) data access operations, (3) web site navigation, and (4) web page presentation. These modeling directions are easily combined into modeling the data and presentation layers, in accordance with the previously described MVC design pattern; with the data layer represented by the data model and the data access operations, and the presentation layer – by the application’s web pages and the linkages between them (or the navigation system). However, in my opinion, modeling the presentation layer presents significant problems, which I will discuss in the next sections.

Argument against Modeling Web Pages

The problem of modeling web pages can be viewed in different contexts. Certainly, there are benefits to it: the development of the application becomes more systematic and, thus, the entire process can be streamlined and subjected to the rigorous approach of software engineering. Nevertheless, consider a basic content management system, which consists of web pages providing the ability to add, edit, delete and view data. Even such basic functionality requires a very fine level of detail, which deals with user interaction requirements unique to each application, which cannot be easily described in a model. Modeling these fine details is pointless: in that case, the model’s level of abstraction would have to be the same as the code’s; and to achieve that it would be necessary to reinvent a general purpose programming language – which makes little sense.

A simplification of the user interface through abstraction is another approach. Milosavljevic et al. (2002) propose to abstract the user interface of data-intensive web-based applications to the following three types: (1) row per page, (2) table per page, and (3) parent-child per page. This, in my opinion, is both an oversimplification and over-specification. For example, a row-per-page is described as a page displaying a single table row (i.e. displaying one data record, which is stored in the table row), enabling the addition of a new row and the updating of an existing row. However, additions and modifications of records can take place in various contexts, and one record per page is only one of them. As a matter of fact, a record per page format is best-suited for displaying, or editing a record, but not adding or deleting one. At the same time, it is an oversimplification: a record, or a data object, can be represented by a join from multiple tables. The same reasoning can be applied to table per page and parent child per page layouts.

Therefore, I conclude that neither modeling the user interface’s fine details, nor simplifying it through abstraction is a reasonable solution. In my opinion, user interface requirements are unique to almost every application and too important to be dropped.

Argument against Modeling Website Navigation

Modeling the navigation of a web application means describing the linkages between web pages and displaying this system on the web site as a set of navigation menus. There are two reasons that I disagree with this approach.

Reason #1: two different systems. Web pages and web site navigation are two different systems and should not be confused with one another. A page is not a menu option, although a menu option is always mapped to a page. Why? Because a “real world” web application may consist of hundreds or even thousands of pages – dynamic or not – all of which simply will not fit on the web site’s menu. Each menu option, obviously, should be associated with a specific page. But it is not a 1-to-1 relationship, for there may be plenty of pages mapped to the same menu item. Using web pages to model the web application’s navigation system is feasible only for relatively small applications. Besides, website navigation is a tool for navigating the site’s content – it’s not a catalog.

Reason #2: requirement for dynamic navigation. Modeling navigation will render a web site’s structure static. A web site can be viewed as a collection of content presented in the form of web pages. The way these web pages are organized and interlinked does not have to be static. Applications of dynamic navigation range from trivial content management features, enabling the site administrators to change the site’s entire navigation structure (new pages are created, existing pages are moved around, menus are altered) to adaptive websites, which change their structure based on the users’ preferences, or even their navigation history – the possibilities are endless. Therefore, website navigation, in my opinion, should not be modeled, but rather made as dynamic as possible.

Selecting the Data Layer for Generation

Examining existing code generation systems and current research, as well as the actual code of existing data-intensive web-based applications, I found that the data layer contains the most recurring code patterns in such applications: the data access code, as well as data access functionality in general, remains almost the same regardless of the application domain. Therefore, I concluded, that the data layer is the optimal part of the application for abstracting common features and automatic code generation.

Modeling Data and Data Access Functionality

In the previous section, I identified the data application layer as the application’s most optimal part for modeling and automatic generation. In this section, I will discuss how such modeling can be implemented.

The Entity-Relationship Model Approach

The most common approach to model the application’s data is to use the entity-relationship (ER) model, introduced by Chen (1976) and later extended in several ways. The basic idea of the ER model is that “using sets and relations we can model objects of the real world and their inter-relationships” (Vigna, 2002b, p. 35). Consider a basic news website: in this example, “author” and “article” might be the main entities. Each entity has a set of attributes: the author entity might have attributes like “first name” and “last name,” the article entity might have attributes like “title,” “body” and “date.” Relationships model connections between entities. In our example, we may have an “authorship” relationship between an author and an article.

Most of the code generation systems I have examined, such as the ones described by Ceri et al. (2000), Fratenali et al. (2000), Vigna (2002a), Vigna (2002b), Vigna (2002c), Vigna (2003), Whitehead et al. (2004) and others, use the ER model to conceptualize the application’s data model. The benefits of this approach are two-fold: first, this approach does not require any database-specific knowledge – the reification process (i.e. transforming a conceptual schema into a specific logical database schema) generates the database automatically; secondly, a system based on the ER model can interpret the data model in a deeper way, compared to a system where the data is specified as database tables, since it has the knowledge of the underlying conceptual model.

On the other hand, specifying an application in terms of entities and relationships adds a level of abstraction, which causes additional implementation complexity. Besides, specifying the database structure directly provides the developer with more fine-grained control over the implementation of the data model, which, in my opinion, is crucial in real world applications.

Once the data model is specified, the next step is to describe the data access functionality. There are two options: we can either describe the data functionality explicitly – i.e. specifying each data access method; or we can have the data access methods derived from the data model alone.

Defining Data Access Operations

An approach describing all the data access functionality is offered by Jacob et al. (2006a) and Jacob et al. (2006b). These two papers propose a code generation framework where describing the data model of the application is separated from describing the data-related functionality.

According to Jacob et al. (2006a), most of the currently available modeling environments for these kinds of applications enable the specification only of simple data access functionality. However, the authors argue, web applications not only provide large amounts of data to be displayed on the web, but also require powerful operations that determine the manner of content provision and allow data manipulation (p. 77). Indeed, functionality like persistent shopping carts, reporting tools, or flexible data grids, that enable the user to browse through thousands of records, require more than the standard create/read/update/delete data access functionality. As Jacob et al. (2006a) correctly states, the required operations not only change simple content entities of the web application, but also add and modify relations between these entities. According to Jacob et al. (2006b), today's web applications have to provide at least the functionality allowing (1) to add, alter and delete entities or relationships between them, and (2) to filter and sort entities according to some criteria.

To solve this issue, the authors propose an Operation Model – a framework which enables the modeling of data access operations. Figure 1 displays a simplified example of this model.

Figure 1. Operation Model Example.

The example in Figure 1 defines the data access operations for the “user” entity, defined in the content model. The following data access operations are defined: “add,” “modify” and “delete.” Also, whenever a collection of user objects is retrieved, it will be sorted by the “name” attribute in ascending order, providing the option to filter the results by name.

I disagree with two issues in this example. First of all, in my opinion, it is unnecessary to specify the obvious: every data object will require the add/modify/delete operations. Secondly, when displaying a collection of records, it is helpful to provide the user with the option to sort by multiple fields, both in ascending and descending order. Therefore, the “sortability” criteria could be attached to an attribute of an object in the data model. For example, if we have a “user” object with a set of attributes, some of them – such as “name” – could be marked as “sortable,” while others – such as an attribute like “biography” do not need to be sortable.

Overall, the operations model is a good example of specifying data access functionality; I found the syntax to be intuitive and parts of the authors’ strategy to be applicable to my own code generation framework.

Deriving Data Access Operations

Another approach is to specify only the data model of an application and generate the data access code based on the data model alone. This approach is based on the assumption that the data model itself is sufficient for defining the necessary data access functionality. In my opinion, the following functionality can be derived from the data model for each object: (1) adding, modifying, reading and deleting a record, and (2) reading a collection of records based on some criteria – with a record representing an entity or a relationship.

However, existing code generation systems which use this approach generate only the very basic data access methods. For example, the code generation system described by Milosavljevic et al. (2002) offers the very basic functionality: in addition to the standard create/read/update/delete methods, it offers methods like Get[object]Count, Insert[object, index] and checkContrsaints. This situation is easily explained by the potential complexity of the data access functionality of even a trivial data model. Consider the following issues:

  1. When a record is created or modified, some attributes are supplied by the user, while others, such as the record identity, a date and time of the record’s creation and/or modification, any attributes derived from other attributes, are generated by the system.
  2. When a record is deleted, in some cases other records, known as “weak entities” in the ER model, should be automatically deleted with it, for they cannot exist without their parent record (for example, orders and suborders, or books and editions). In many cases, the choice depends on numerous criteria determined by the application’s business logic.
  3. Retrieving a single record is relatively easy; however, retrieving a collection of records may involve selecting what fields to retrieve.
  4. Retrieving a collection of records can be done based on some criteria; deriving this criteria from a data model without explicit specifications is not a trivial task.

These issues raise an important question to consider: what should a data model specify besides the structure of the data (i.e., entities and their attributes)? How detailed should a data model be to provide enough information for the system to be able to automatically derive the necessary data access functionality?

Extending the Data Model with Data Access Details

Much of the data access functionality depends on the application’s business logic. Therefore, there are numerous tradeoffs in specifying the details of this functionality: specifying too much will tie the data model to the application’s business logic and presentation layer, specifying too little may be too much of a simplification.

One example of such a tradeoff is the choice of a data type system. Turau [8] proposes a system which provides only one data type: a string, with all the conversions delegated to the business logic layer. I disagree with this approach, for it prevents from specifying data type-specific requirements in the data model.

Another example of such a tradeoff is specifying data validation requirements in the data model. Turau (2002) defines validation requirements for each attribute, including ranges of accepted values and multiple validation rules with basic Boolean logic. I disagree with this approach. The data model can and should define rules for structural integrity of the data, such as foreign key and unique value constraints, as well as attribute data types. However, data entry validation rules are based, to a great extent, on the application’s business requirements and, therefore, belong in the business logic layer. It is conceivable that these requirements may change over time (like the set of accepted payment methods in an e-commerce application), in which case hard-coding them into the data model will render the application too rigid.

However, one of the main considerations in specifying the data model is choosing a way to describe the requirements for displaying data in the presentation layer. The most comprehensive solution is offered by the WebML language (Ceri et al., 2000) in the form of its composition model. This model specifies content units which make up the application’s web pages. This model is coupled with the presentation layer, which is an approach I am trying to avoid; however, some of those ideas, if decoupled from the presentation layer, are quite interesting.

The composition model consists of several content units, the most relevant of which are demonstrated in Figure 2.

Figure 2. WebML Composition Model Syntax.

A Data Unit shows information about a single object: an instance of an entity. It is defined by an entity and a selection of attributes which are included in the unit. A Multi-Data Unit displays a collection of data units, by repeating a single data unit. An Index Unit presents multiple instances of an entity as a list. A Scroller Unit provides commands to scroll through objects in a list and is used in conjunction with a data unit. A Filter Unit provides a search feature which enables the user to search (i.e. filter) through the data units displayed as a list and, thus, is used in conjunction with a Multi-Data Unit or an Index Unit.

In my opinion, there are parts in this model, which may be improved. For example, the Index Unit should specify the displayed attribute: an index displaying a selectable list of users might use the “last name” attribute as the one to be displayed in the list, but the “user Id” – as the key for selected records. Another example is the Scroller Unit, which, in my opinion, is not necessary at all: any entity must come with this functionality. Besides, a scroller (or a pager) belongs in the presentation layer and should be attached by default to any collection of data objects. Finally, a Filter Unit belongs in the presentation layer, with its functionality derivable from the data model.

Regardless of the mentioned issues, the WebML approach presents a detailed and clean way to specify how data should be displayed on web pages. It offers the ability to specify the fine-grained details of data retrieval operations and, if decoupled from the presentation layer, might have been a viable option for this study’s code generator. Nevertheless, I believe that specifying the data access requirements in the data model as additional attributes should be sufficient for deriving the required data access functionality – which I intend to demonstrate in the experimental part of this thesis.

Summary of Background Material and Related Work

In this chapter, I described data-intensive web-based applications as the subject of the thesis and defined them as systems, which require comprehensive data access functionality for providing web-based “read”" or “read/write” access to data stored in a data repository, such as a database. I showed that automatic code generation is a solution to numerous implementation issues, which arise as a result of recurring functionality requirements in such applications.

I explained that only part of the application can be automatically generated. To identify this part, I used the Model-View-Controller design pattern and analyzed this type of applications by decomposing it into three conceptual layers: data, business logic and presentation. After examining existing code generation systems, current research and the code of existing data-intensive web-based applications, I concluded that the data layer is the optimal part of the application for automatic code generation.

After reviewing the concept of data modeling, I concluded that there are two general approaches to describe data access operations: explicit and implicit. The explicit approach is based on describing every data access method individually. The implicit approach is based on deriving the required data access methods from the data model alone – which is the core of my research hypothesis.

 
contact me
blog
research
about