1. Introduction

Distributed data depositories are becoming more common on the Internet. The amount, type, and the format of the data stored in such depositories are also getting larger and more varied. Hence, information retrieval from networked information environments is becoming a more complex and important challenge.

Users of search engines (e.g., Yahoo, Alta-Vista, and Info-Seek) on the Internet are faced with new problems in attempting to benefit from new technologies. Many of these users are suffering from the confusion caused by these engines in the search process.

Most of the search engines on the Internet facilitate keyword search options over the titles of homepages in the network. In many of the cases, keywords are entered using text entry fields and then a "submit" button is used to start the search process. The submission operation is followed by a display of results that may or may not contain a useful set of hits corresponding to the query. Most of the search attempts produce a huge (mega) set of irrelevant hits. Sometimes zero hits are received. Possible causes of zero hit queries are spelling errors or lack of data. There is no systematic way for users to know which situation will result prior to launching time consuming queries, if they are dissatisfied with the results, there is no way to get a clear view of which alternative queries might be possible.

Research on information retrieval over networks is gaining momentum as users are frustrated with the unsuccessful queries submitted to existing search engines. Recently, the Human-Computer Interaction Laboratory at the University of Maryland proposed a new method for more efficient querying over the network, namely Query Previewing. The concept of Query Previewing takes advantage of the fact that most queries are intended to retrieve a relatively small portion of the database. When this assumption is valid, most of the data can easily be pruned even before the completion of the query. This will decrease the network workload and increase the possibility of returning highly relevant hits to the queries that are being formulated.

Implementation of a Query Preview Search Engine for the Global Change Master Directory (GCMD) of NASA was recently finished suggesting directions for future research on this topic. The implementations used showed that although Query Previewing is a powerful idea for search purposes, not every data type can be trivially integrated into this new approach.

This report analyzes these data type related problems and explores improvements to the existing implementations. First, in this section, we present Query Previewing ideas in more detail. Later on, we focus on problems with the current implementations and propose solutions. The second section presents the solutions in more detail along with comparisons of these solutions. In the third section, the solutions are analyzed empirically and the last section states our conclusions.

1.1. Background on Query Previews

Dynamic queries introduced by [1,7,10] and studied by [4,5] solve many of the problems presented by queries performed in networked information environments. Dynamic queries provide a visual representation of query formulation tools and results. They enable a rapid, incremental, and reversible control of the query. They also give continuous immediate feedback to the user for guidance in query formulation. Figure 1 shows an example dynamic query interface (DQI). A similar solution for user interfaces to databases was proposed previously by [9].

Figure 1: A sample DQI, Spotfire, courtesy of IVEE, AB. Demonstrations are available from online site www.ivee.com. The widgets on the left, bottom, and right are for defining a query. The large square area in the middle is called the starfield display for showing the set of hits to a specific query. The number of hits is displayed at the bottom part of the interface.

The application of dynamic querying ideas to networked querying environments appears to be useful. However, high system-resource needs make dynamic querying less applicable, at least at first glance, to huge networked information depositories. A solution to this problem might be found in the division of the bigger problem into smaller problems. This idea resulted in the design of Query Previews [2,3]. The goal is to give an overview of the database to the user before the details are presented.

A good preview should enable users to see sufficient detail about the database in order to better understand the data distribution and then make an informed query. Furthermore, the querying process should be divided into steps to reduce the amount of resources needed to form the queries that are submitted. So a multi-phase incremental querying process will hopefully lead to a set of desired results using less resources and time. In a two-phase approach the designer chooses a few of the most discriminating attributes of the database as the first phase (possibly those that form a primary key for that relation of the database). Then, by using the direct manipulation approach, a user interface should be designed using this small number of discriminating attributes. The rest of the attributes should be kept for a second phase that will also include these discriminating attributes.

The first phase when concatenated with the second one should form an interface for querying on the related database. When the querying environment is activated the first interface should show up immediately. The user should make some decisions using this first interface (Query Preview Panel) and then move to the second one (Query Refinement Panel). The second interface should be used to complete the query.

The Query Preview Panel is a powerful tool to define rough ranges on the data set that is being manipulated. The reason for this is it contains the most discriminating attributes in the database so that any range constraint will lead to a small subset of the database. It also does not consume substantial system resources because only a small number of attributes from the database are used at this phase.

To guide users in the query formulation process the preview panel presents aggregate information about the database. Benefits of a similar approach were also studied by [6]. The data distribution on an attribute of the data set is shown before the actual query formulation. The distribution of data over an attribute can be shown as a pie or a bar chart. When users make a selection on any of the attributes of the preview panel the rest of the user interface should be updated accordingly. This is called tight coupling. Therefore, for each action the users take, feedback is given. As the users see the possible size of their query before refining the rough ranges, there is little chance that they will get zero or mega hits to the final query itself. The system load will drop drastically for downloading the necessary hit set due to the limited size of hits to the rough query. Perhaps the most critical advantage of the previewing idea is that the designer of the system needs only the aggregate information about the system at this phase. So whatever the size of the database is we only need the distribution information of the data to form a preview panel (this again decreases the system resource needs). An example Query Preview Panel is given in Figure 2.

 

Figure 2: An example Query Preview Panel developed at the Human-Computer Interaction Laboratory at the University of Maryland. Year and location on earth are examples of discriminating attributes for many scientific data sets obtained from the NASA archives. The bars show the distribution of data.

1.2. Problems with the Current Implementations

The aggregate information needed by the Query Preview Panel is the histogram of data distributions in the database that is being queried. For example, the location, the year, and the parameter attributes of the database were used in Figure 2. The year attribute shows the distribution of data over years, the location attribute shows the distribution of the same data set over continents, and the parameter attribute shows the distribution over some possible parameters.

The internal representation of this histogram can easily be done with an n-dimensional array of integers where n is equal to the number of attributes used in the Query Preview Panel. Each dimension of this array will have the granularity of the related attribute of the Query Preview Panel. For example, the three dimensional array (cube) for the Query Preview Panel of Figure 2 has a granularity of 10 for years, a granularity of 16 for locations, and a granularity of 12 for parameters. Therefore our cube contains 1840 integers in total to show the data distribution over the Query Preview attributes. These integers are calculated using some batch processes that are triggered by the database system administrator (can be run as over night off-line processes).

In most of the Query Preview implementations the size of this array which contains the necessary aggregate information is small with respect to the data set size. For example, the preview in Figure 2 can be represented with only 1840 integers and in most of the current systems this is less than 10 Kbytes of information. The Internet is fast enough to accommodate the transfer of 10 Kbytes very easily over large distances and even in high levels of congestion (especially with respect to the size of the database where in the case of NASA this is generally more than a terabyte of information). Also, most of the data structures presented by [8] use a similar approach to represent the data internally for DQIs.

This internal representation is powerful with many of the data types. On the other hand, some of the data types can not be handled easily with the n-dimensional array approach. For example, the data set used for the application in Figure 2 might be a multi-valued data set (i.e., any record in data may cover more than one year, location, or parameter, e.g., a data set in North America could be taken for both of the years, 1990 and 1991). Then this will end up with the duplication of the same record of the database in our cube (once for each year it covers). Hence, the distribution represented with this cube no longer contains the same information presented with the actual data set itself. The problem is more dramatic if ranges of values are used in a database (e.g., years 1983 to 1991). This will result in large numbers of duplications in the counts of the cube that will be used by the Query Preview Panel.

To accommodate multi-valued and range data types, the n-dimensional array approach must be extended to allow the implementation of Query Preview ideas. With this paper we present two possible solutions to this problem and compare them both theoretically and empirically. The following section gives the general ideas behind each solution.

1.3. The Two New Solutions: Post and Pre-Fetch

The n-dimensional array of integers showing the distribution of data does not contain the duplication information that is necessary for multi-valued attributes. A possible approach might be to calculate the duplication information (again off-line) and send it over the network. But this might increase the array size dramatically. Both of our approaches use the fact that we do not want to send huge chunks of data over the network prior to Query Previewing or make heavy weight calculations during the query formulation process. Hence, we define two new client-server based approaches for this problem. Both of the approaches use the paradigm of pre-calculating the answers to possible queries of a Query Preview Panel. Hence, no online calculations will be necessary during the querying process. As we know the questions to be asked on a panel of finite choices, we can calculate the answers to these queries off-line and store them in our secondary storage until the actual query is made. As the size of these pre-calculated answers can again be huge depending on the granularity of the attributes of the Query Preview Panel we propose to store them on the server side and then send the related answers for a query upon request. An appropriate name for this approach is "post-fetch", as we will not download anything from the network unless it is really necessary. Actual data transfer will only be done when a query, either in the form of a preview or final query, is formed. The selections made will be sent to the server from the client side and the answers will be sent to the client from the server side. Both of these transfers will not exceed 100 bytes due to the fact that the communicating sides exchange only a few integers (the questions are basically the selections made by the user and the answers are the new settings of the bars).

The size of the possible set of pre-calculated answers would be in the order of megabytes because the amount of information stored on the server is exponentially proportional to the granularity of the attributes. Luckily, we only have granularities of ten or less (in general) for an attribute. An immediate extension of this approach can be the case where the same pre-calculated counts are stored on the client side. Hence, the transfer occurs only before the actual preview program starts. This approach is called the "pre-fetch" approach. The size of the pre-calculated counts for all currently provided queries are too large for direct site to site communication so some restrictions should be brought to the preview program before enabling such applications. Most of the preview panels can easily be adjusted to restrict the number of choices for a single attribute. This will end up with reduced number of pre-calculated answers and hence we will have a smaller pre-calculated count file to be transferred over the network. This makes the Query Preview Panel more restrictive while making the application more efficient. The following sections will analyze these approaches in more detail, give the exact client-server architectures, and then compare them in terms of the results obtained from the executions of the test programs (run time and space utilization).