OCSB: On-line Citation Searching and Browsing

Michael Reed
Theen-Theen Tan
James Wanken

CMSC 828S/838S

12/08/97


Table of Contents


Introduction and Motivation

The project discussed in this paper was motivated by the large collection of bibliographic information on computer security assembled by the IEEE Technical Committee on Security and Privacy. While the committee had accumulated this vast collection, it was rarely used because of its straight text file format. The text file closely resembles an entire card catalog printed one after another on a sheet of paper. There is clearly some order to the file, but to attempting to extract any meaningful information from it is troublesome at best. The collection contains a vast amount of information, but extracting any of it from this file was next to impossible. In the current form, only a very limited number of searches could be performed. To perform searches, one would have to use a text editor and search for a specific keyword, author's name, or conference proceeding. Neither the attributes available in the file, nor the number of entries fulfilling a specific attribute could be easily found. Thus, any processing done on this file was performed in an ad hoc manner, and very rarely produced any useful results. This project involves creating a visual interface that will be useful for the committee as a querying and browsing tool.

Our prime motivation for building a tool for the visualization of this data was to create a tool that would be useful for the IEEE committee. Thus, the tool was constructed to allow various queries on the database, to find related papers, and to provide feed-back to the user to help them refine and prune their query. The tool is focused heavily on the type of actions that are useful in document searches. The tool allows relationships between the authors, papers that appear in the same conference, or papers that have the same keywords to be to be explored. The tool updates the visualization after each selection to give instant feed-back to the user for further refining of the query or exploring the document corpora. Thus exploring complex relationships in the document corpora is as simple as selecting a few criteria from the display. Even for simple tasks such as these this tool is a tremendous help over the previous abilities. Furthermore, by browsing the appropriate scroll lists, the possible attributes for any given field can easily be found. This is a great help as it eliminates queries that will return a null set of answers because a specified attribute in the query is not in the data file.


Discussion of Previous Work

Current document retrieval presentations can be separated into two main categories. The first category, keyword searches of text, focuses on the retrieval of information from a wide set of documents. Keywords, or phrases, are entered and the system finds and displays the relevant documents found through a word based search. Keyword searches have the advantage that they process the entire text file, and are not limited to the information in a reference citation or a pre-determined set of attributes about the file. Many of these systems attempt to map the text content to a spatial representation that preserves the information characteristics of the document. The second category, search by attributes, focuses on discrete attributes of the documents. These attributes are cataloged, searched and displayed by the system, but the actual content of the document is not processed by the system. The prime example of this category is the familiar on-line card catalog VICTOR. VICTOR allows queries of subject, title, and author, but does not try to examine the contents of the document to find matches to the query. Very few implementations actually exist which fall under the second category, and hence we focus on the first category in the following discussion.

MVAB, Multidimensional Visualization and Advanced Browsing project, [Wise95] performs word counts on the document's text to create a feature set. Higher order statistics on words and strings are also collected. MVAB offers two distinct visualizations of the results. In the Galaxies visualization, the documents are shown as points in a 2D scatter-plot. In the ThemeScapes visualization, the documents are displayed as 3D landscapes of information constructed from the document corpora. Elevation depicts theme strength, while valleys, peaks, cliffs, and ranges represent detailed interrelationships among the documents. One major drawback to these visualizations though, is that neither representation gives any indication what the axes in the Galaxies or ThemeScapes displays represent. Thus, the user is forced to comprehend complex interrelationships between documents without any clear idea on how the documents are related in the display. A DoD briefing is available on this project. This type of visualization is particularly ill-fitted for the reference searching tasks that we would like to perform. MVAB only allows a single view of the document corpora, and manipulation of this visualization is limited. In our tasks we would like to find a particular set of references that fulfill some criteria, and are less interested in overall relationships in the document corpora.

VIBE, Visualization By Example [Olsen93], and a related system GUIDO [Nuchprayoon94] has the added benefit that the user defines particular points of interests (POIs) to construct the visualizations. Each POI is defined by a number of keywords and a display position defined by the user. The position of a document in the visualization represents the relationship of the document to the POIs. Unfortunately, documents that have the exact same relationship to all the POIs will be stacked on top of each other in the display, and the user must perform multiple clicks to rotate through each document. Robert Korfhage (under the link Research interests, Information Retrieval) has been involved in other document retrieval systems such as BIRD [Kim94] which provides an interface to divide and merge lists of documents based on their contents. The main drawback of this approach for our problem is that the method is only well defined for keywords. We are also interested in the authors of the papers, the years that they were published, and in which conference or journal in which the paper was published.

Others have used a multi-dimensional display to visualize documents. David Ebert, who demonstrated SFA [Ebert 1996] to our class, constructed a minimally immersive interactive volumetric visualization. The X, Y, and Z axes represent different themes and a document's position is determined by its n-gram scores on the themes. Additional document features can be encoded with color, glyph topology, glyph transparency, and size.

One project that is similar to the theme of this report is Envision [Heath94] which provides full-text searching and content retrieving capabilities. Envision can also perform queries on specific document attributes, such as the authors, and words in the title. The results of the queries are displayed in a matrix of icons. The axes can be manipulated by the user to display the year, type, size, author, index terms, and relevance factor. All citations and full-text entries are stored as SGML files which can be converted to HTML documents for retrieval over the web. The Envision matrix display closely resembles a starfield display constructed with discrete variables.

One effective tool using only search by attributes is the Library of Congress collection browser developed at the University of Maryland HCIL [Plaisant97]. The collection browser allows queries by collection topic, format, and date. A graphic display is used to display each collection with respect to the interval of time that it covers. The filters and display are tightly coupled and changes or selections in one rapidly updates all of the other lists and the display. At any time, a single collection can be selected for viewing. We drew many useful ideas from this project and incorporated them into our final product.

The list of document retrieval systems continues with the visualization of search engine results in a 3D display [Mukherjea96], Tilebars [Hearst95] which offers a unique way to visualize the occurrence of words in a document, the HCIL Paper Search, and the numerous visualizations offered by the Information Visualizer [Robertson93].


Original Data File

The original data was stored in a straight text file and contained information from approximately 4,000 papers related to computer security issues. Each paper was represented as a reference as one would find in a bibliography. The file usually followed the format of listing the authors, title, source information, abstract, and a set of keywords. One of the main problems with this file was that a single format was not strictly adhered to, and often much of the information would be arranged in a different order. Each specific type of information was also not present for all of the papers. Since the references were collected over a long period of time, many various abbreviations and styles were used in the references. Thus, not only was the data not in a standard format, but the information, such as abbreviations, name conventions, etc., was not standard throughout the file.


Data File Format for Visualization

The original data file received from the IEEE Technical Committee on Security and Privacy was determined to be in an unusable form to process automatically. Thus, the data file had to be converted to a form that allowed easy processing, adhere to a well defined format, and preserve all of the information in the original data file. Each reference was broken down into the following fields: author, title, source, year, abstract number, keyword, and the full citation as could be found in the original data file (minus the text of the abstract). Each of these fields are represented with a special tag in the new data file. It was not required that each entry in the new file contain all of the tags, and each could have as many of each tags as was necessary to represent the appropriate citation. The text of the abstracts was stored in a separate file to facilitate cgi-bin processing. Each abstract was given a unique id number used in the new data file format, as described above, to reference the text of the abstract.

The original data file was not completely converted over to the new data file format because of its large size. A portion of the original data file was converted to show the benefits and applicability of our ideas to this information domain. Approximately 20% of the original data file was converted into the new data format. It is estimated that an additional 75 hours of work would be needed to finish converting the data file into the new format. Although this process is not very difficult, it is very tedious and time consuming. Finally, because of the lack of structure in the original file this conversion process can not be easily automated.


The Final Product

The citation browser presents the user with a number of list widgets to form the citation query. The list widgets allow you to search by:

The actual query (displayed in the Full Citation window) is computed by first generating the logical OR of all selected items in a list widget and then taking the logical AND of all list widgets. The abstract is pulled up automatically if the query results in only one citation, or if the user selects one of the citations in the Full Citation window.

The list widgets make use of a number of visual feedback mechanisms. By default all items are initially "dimmed" (red portion of the distribution bars). As the user selects items in the list, items will become "undimmed" (blue portion of the distribution bars) in all of the lists. The undimmed items match those citations shown in the Full Citation window. Next to the scrollbars an overview of the list shows where undimmed items appear in the list. At the bottom of the list widgets are three buttons: Set, Sort, and Clear. Set and Clear can be used to select all or none of the items in that list. The Sort button will reorder the list bringing the selected and undimmed entries to the top of the display allowing for easier browsing.

The abstract can be cut-and-pasted to other applications on the users' computer. Future work will make it possible to cut-and-paste the full citation as well.

In addition to the main applet, we have added a histogram window which shows a graphical representation of the distribution of your query over the entire database. This window, like all other widgets in the display, is updated every time the query is refined.

You can play with the citation browser on-line only if you have a JDK 1.1.4 compliant Java browser (IE4.0 does a pretty good job, but to see it best, use Sun's appletbrowser. NOTE: No current version of Netscape is JDK 1.1.4 complaint and as such, the applet will not run.) To start the citation browser, click here.

Source Code


Sample Queries

The first sample query shows a well define search query. The second query shows how users can use the features in the interface to help them in browsing the database.

Sample Query 1: Find all entries on 'verification' published in 1990 and beyond

Users see that the highlighted entries are the values they searched upon. Users can see from the distribution bar by on the year panel that one of the results were published in 1990 and two the result entries were published in 1991. The full citation of the results are grouped at the top of the full citation window. Users can click on one of the entries to read the abstract of that article.

Figure 1: Sample Query #1

Sample Query 2: Browse the publications by Marshalls Abrams.

The bars by the scrollbars indicates the position of the related values in an attribute list. Users see from the position bar that beside 'Access Control' there are other keywords related to this author. They can scroll the list to browse through the undimmed values (black), or they can invoke sort to percolate the relevant entries to the top and further refine their queries. In this scenario, the user is interested in Abrams' publication related to 'Risk Assessment'.

Figure 2: Sample Query #2


Discussion

We tried a number of different screen layouts and eventually evolved to what you see today. The suggestion of making the year list non-scrolling came from our design review with Dr. Shneiderman. Likewise, the list selection indicators next to the scrollbars came out of that review.

During the development of our citation browser, we tried a number of different ways to interact with the user and visualize the lists of information. On of our early ideas was to sort the lists automatically as a user selected items in the list. We tried sorting all lists every time a selection took place, sorting only the list in which the selection took place, and sorting all lists except the list in which the selection took place. Everyone who viewed these versions of the code concurred that it was more disorienting and that a user triggered sort button would be better.

We are fond of the idea of grouping the pruned list so that the users get a compact view of the pruned list of values without having to scroll through the list. This grouping facilitates the information gathering as well as further selection on the pruned list. However, the sorting causes potential disorientation if the forthcoming query is unrelated to the previous query in which sorting was invoke. The entries are no longer in alphabetically order once sort is invoked. Users will have to keep track of (or search) whether the attributes has been percolated to the top, or is still arranged alphabetically in the list. The way to deal with it in the current interface now is to clear all selections related to the previous query, and invoke a sort to return to the alphabetically sorted list. An improvement that can be made is to provide a clear all selections in all attributes function feedback slows down as number of entries grow.


Possible Extensions

Other visualizations:

The advantage scatter-plots over scrolling list is that it eliminates some scrolling, and offers a compact overview. We could have included scatter-plots, for example showing the years vs. authors or keywords, and use size of the points to encode the number of total records available in the database. These visualizations will help the users in tasks such as identifying trends and outliers in a compact overview. Examples of trends to be identified are the development of topics in the field, and the research interests of an author through time. An example of identifying outliers is finding the author with most publications in a particular topic.

Other features:


Future Work & Conclusion

We have chosen to demonstrate the features which we think are most essential to the tasks involved in querying and browsing the citation database in our prototype. One of the main goals of an interactive interface is to maintain rapid feedback. While considering possible visualizations or feature extensions, we ought to keep in mind the trade off between these additions and response time. Work needs to be done to convert the entire database into the tagged format we proposed. We have laid out the issues of building a visual query and browsing interface for the citation database. Future work to bring the database into a fully usable product will be carried out by the citation database committee.


References


Last modified: Mon Dec 8 11:53:08 EST 1997 - {reed,ttan,jwanken}@cs.umd.edu