As per Relevance of the word reference, we have this rfc below:
Network Working Group W.
Request for Comments: 1691
Category: Informational August 1994
The Document Architecture for the Cornell Digital
Status of this
This memo provides information for the Internet community. This
does not specify an Internet standard of any kind. Distribution
this memo is unlimited
This memo defines an architecture for the storage and retrieval
the digital representations for books, journals, photographic images
etc., which are collected in a large organized digital library
Two unique features of this architecture are the ability to
reference documents and the ability to create multiple views of
document
In 1989, Cornell University and Xerox Corporation, with support
the Commission on Preservation and Access and later Sun Microsystems
embarked on a collaborative project to study and to prototype
application of digital technologies for the preservation of
material. During this project, Xerox developed the College
Access and Storage System (CLASS), and Cornell developed software
provide network access to the CLASS Digital Library
Xerox and Cornell University Library staff worked closely together
define requirements for storing both low- and high-
versions of images, so that the low-resolution images could be
for browsing over the network and the high-resolution images could
used for printing. In addition, substantial work was done to
documents with internal structures that could be navigated.
developed the software to create and store documents, while
developed complementary software to allow library users to browse
documents and request printed copies over the network
Cornell has defined a document architecture which builds on
lessons learned in the CLASS project, and is maintaining
library materials in that form
Turner [Page 1]
RFC 1691 CDL Document Architecture August 1994
Document Architecture
Just as a conventional library contains books rather than pages,
the electronic library must contain documents rather than images
During the scanning process, images are automatically linked
documents by creating document structure files which order the
files in the same way the binding of a book orders the pages. Thus
the digital book as currently configured consists of two parts: a
of individual pages stored as discrete bit map image files, and
document structure files which "bind" the image files into
document. In addition, a database entry is made for each
document which permits searching by author and title (i.e.,
bibliographic information). Beyond the order of the pages,
arrangement of a physical book provides information to readers.
title page and publication information come first; the table
contents usually precedes the text; the text is divided into
or chapters; if there is an index, it follows the text. The
often refers to these components of a book when browsing the
shelves, in order to determine whether to read the book
The document structure provides direct access to the components of
electronic document, storing the information that would otherwise
lost when the book is disbound for scanning
Document Architecture
Listed below are the requirements that were initially set down
the Cornell Digital Library Architecture
1. The architecture must be open (i.e., published and
available).
2. The architecture should be as simple as possible (to
product development).
3. The architecture should assume data storage in UNIX file systems
4. The architecture should allow for standard data usage, such as
FTP and Gopher servers (i.e., pages of a document must exist in
single directory, and the naming convention used must order
in the standard collating sequence, such as the series "0001.TIF
0002.TIF,..., 0411.TIF" (NOTE: a series such as "1.TIF, 2.TIF,...,
10.TIF" would be ordered "1.TIF, 10.TIF, 2.TIF, ..." which is
acceptable).
5. The architecture should provide for storing the same
in different formats. For example, when a page of a document
available at several different resolutions
Turner [Page 2]
RFC 1691 CDL Document Architecture August 1994
6. Low-resolution "thumbnail" images of each page must be stored
facilitate browsing and sharing of data
7. The architecture must support distribution of files so
similar files may be stored together, permitting optimization
storage use and performance
8. The architecture must support documents that are composed
references to all or part of other documents
9. The architecture must support document components which
stored on separate servers distributed across the network
10. The architecture must support not only an hierarchical
for each document, but the ability to define multiple views
each document
11. The architecture should accept, rather than dictate,
structures in which documents will be stored. This will
documents created in other ways to be added to the
Library simply by adding database information rather than
copying or moving files
Document Architecture
A digital library consists of a Digital Library Server,
storage, and a referencing database. A single digital library
contain one or more collections. Each collection will contain one
more documents
The referencing database allows searching for documents by author
title, and document ID. In the current implementation,
referencing database is a relational SQL database, and
collection is epresented by a table in the database. It is
to migrate to Z39.50 database searching as the preferred method,
this protocol has been established as the standard for
applications
Authorization will be primarily collection-based, although the
will permit authorization checking at any level down to
individual file. Notification would come only when the
attempted to open the document or access the particular component
Each document consists of three components: the logical structure
the physical references; and the data files
Turner [Page 3]
RFC 1691 CDL Document Architecture August 1994
The logical structure is a logical description of the document
Conceptually, a document is a tree, with the leaves being the
files (pages). At a minimum, all documents have a logical
which lists the pages in the document and the order in which
appear. Usually, documents will have a more elaborate structure
The logical structure relates the logical structure of a document
the physical references which make up the document
These physical references map the lowest levels of the document'
logical structure (the leaves of the tree) to the files that
the data. Where there are multiple representations of a page,
as images at various resolutions, these are linked together in
physical references file
The data files contain the data making up a document. Any format
be accommodated: image files, ASCII text, PostScript, etc. However
one-to-one correspondence between data files for a given
reference is assumed. That is, if there are multiple file types
a single page, these files should represent exactly the
information
Physical References
The Physical References file is the component of the document
relates logical structures (logical components of documents)
physical files. Document references, by which a document can
composed of all or part of other documents possibly residing
different servers, are handled in the Physical References file
A document may contain multiple document objects, each of
contains one or more data objects. When a document contains
physical data (for example, it is created by scanning or
images), a Master Document Object is created. When a
incorporates components of other documents, a Reference
Object is created for each of the other documents. The
Objects are numbered with internal reference numbers, which
included in the corresponding Data Object lines
Data Object lines include the Document Object number, the
reference number, and the file type. The Document Object
refers to a Document Object line, from which the library name
collection name, and document ID can be retrieved. The
++++reference
is guaranteed to locate a file. Each Data Object line refers to
single file; where multiple file types of a single document
exist, there will be multiple Data Object lines for that page
Turner [Page 4]
RFC 1691 CDL Document Architecture August 1994
In the file, all Document Object lines will preceed all Data
lines for a given document. Document Object lines may be
grouped together at the beginning of the file, or may
preceed the first Data Object line for the Document Object.
Object lines will appear in order by Document Object number.
Object lines will appear in order by sequence number, NOT by
Object number
The fields in the Physical References file are delimited by
bars
Document Object
Field Description
----- ---------------------- ----------------------------
1 Document Object number 0 => Master Document
1-9 => Reference Document
2 Library name Server
3 Collection
4 Document ID 8-digit
5 Author
6
7
8
Data Object
Field Description
----- ---------------------- ----------------------------
1 Document Object number Corresponds to
2 Sequence
3 File reference Reference number used to
file in filing
4 Physical reference number Equal to Logical Structure
5 File type 1 = TIFF 600
2 = TIFF
3 = ASCII version of
(i.e., OCR output
4 = ASCII
5 =
6 = TIFF 300
6
Turner [Page 5]
RFC 1691 CDL Document Architecture August 1994
Physical References File
+0|CORNELL|OLINLIB|00000001|Boole, Mary Everest||Philosophy Of Algebra||
|0|1|00000002|5|1|| (File ref. #2 = Phys. ref. #5 = 600dpi TIFF image
|0|2|00000003|5|2|| (File ref. #3 = Phys. ref. #5 = 100dpi TIFF image
|0|3|00000004|6|1|| (File ref. #4 = Phys. ref. #6 = 600dpi TIFF image
|0|4|00000005|6|2|| (File ref. #5 = Phys. ref. #6 = 100dpi TIFF image
Note that in the above, it is guaranteed that file references 2 and 3
are two different versions of the same page, as are file references 4
and 5.
Logical Structure
The Logical Structure file is the component of the document
which offers "views" of a document and links images
logically to define documents. The file is actually an unloaded tree
when a document is "opened", the file is read and the
reconstructed. By convention, all Logical Structure files contain
logical structure "PAGES" which defines the document by listing
pages in the order in which they appeared in the original document
Document Structure
Field Description
----- ---------------------- ----------------------------
1 Parent structure number Structure is a child of...
2 Sequence
3 Logical Structure name Label for this
4 Structure number Equal to Physical Reference
5 Logical Children # of logical children of
Document Structure lines (continued
Field Description
----- ---------------------- ----------------------------
6 Physical Children # of physical children of
7 References # of references to
structure within this
(for how many structures is
a substructure
Turner [Page 6]
RFC 1691 CDL Document Architecture August 1994
Logical Structure File
|0|0|ROOT|0|4|0|0| Structure 0, ROOT, has 4 logical
|0|1|PAGES|1|100|0|1| Str. 1, PAGES, has 100 logical
|0|2|CONTENTS|2|22|0|1| Str. 2, CONTENTS, has 22 logical
...has no physical
...
|1|1|Production note|5|0|2|2| Str. 5 is child of structure 1
...has a label "Production note
...has no logical
...has 2 physical
...is referenced twice in this
|1|2||6|0|2|1| Str. 6 has no
|1|3||7|0|2|1| Str. 7 has 2 physical
|1|4||8|0|2|1| Str. 8 is referenced only
|1|5||9|0|2|1| Str. 9 is 5th sequential child of
...
|1|99||103|0|2|2|
|1|100||104|0|2|2|
|2|1|Production note|105|1|0|1| Str. 105 is a child of str. 2
|2|2|Title page|106|1|0|1| Str. 106 has 1 logical
|2|3|Table of contents|107|2|0|1|
|2|4|Chapter 1. From Arithmetic to Algebra|108|6|0|1|
|2|5|Chapter 2. The Making of Algebras|109|4|0|1|
|2|6|Chapter 3. Simultaneous Problems|110|4|0|1|
|2|7|Chapter 4. Partial Solutions...|111|3|0|1|
|2|8|Chapter 5. Mathematical Certainty...|112|3|0|1|
|2|9|Chapter 6. The First Hebrew Algebra|113|8|0|1|
|2|10|Chapter 7. How to Choose our Hypotheses|114|9|0|1|
|2|11|Chapter 8. The Limits of the Teachers Function|115|5|0|1|
|2|12|Chapter 9. The Use of Sewing Cards|116|4|0|1|
...
|2|20|Chapter 17. From Bondage to Freedom|124|5|0|1|
|2|21|Appendix|125|2|1|1|
|2|22|advertisements|126|4|1|2|
|105|1|Production note|5|0|2|2| Str. 5 is a child of str. 105
|106|1|Title page|11|0|2|2| 2nd reference to str. 11
|107|1|7|15|0|2|2|
|107|2|8|16|0|2|2|
...
|126|4||104|0|2|2|
Turner [Page 7]
RFC 1691 CDL Document Architecture August 1994
Implementation
The tuple +<collection ID>+<document ID>++
reference> is guaranteed to locate a file. A file
program will translate between this tuple and the fully-
path and file name in the underlying file system. While a
will always have a hierarchical nature corresponding to UNIX
systems, the order of the hierarchy will be flexible to
optimization efforts. Each level of the hierarchy will have an
file that describes the order of the lower levels of the hierarchy
The file locator program will read these files as it navigates
directory structure of the file system when a library, collection,
document is opened. Two examples follow
Example 1. Hierarchy is LIBRARY, COLLECTION, DOCUMENT, FILETYPE
/
LIBINFO.TXT Description of
/<collection name
COLINFO.TXT Description of
/<document ID
DOCINFO.TXT Description of
LOGSTR.000 Logical structure
PHYSREF.000 Physical reference
/
00001.
00002.
...
/
00001.
00002.
...
Turner [Page 8]
RFC 1691 CDL Document Architecture August 1994
Example 2. Hierarchy is LIBRARY, FILETYPE, COLLECTION, DOCUMENT
/
LIBINFO.TXT Description of
/
/<collection name
COLINFO.TXT Description of
/<document ID
DOCINFO.TXT Description of
LOGSTR.000 Logical structure
PHYSREF.000 Physical reference
00001.
00002.
...
/
/<collection name
COLINFO.TXT Description of
/<document ID
DOCINFO.TXT Description of
LOGSTR.000 Logical structure
PHYSREF.000 Physical reference
00001.
00002.
....
This implementation involves some redundancy, but it permits
copies of a collection to be mounted on different file systems
performance considerations. In particular, the second scheme
facilitate storing all low-resolution images on high-speed
disk for fast access, and all high-resolution images on slower,
expensive storage. This will also facilitate authorizing access
low-resolution images by other software systems (FTP, Gopher)
restricting access to high-resolution images
Turner [Page 9]
RFC 1691 CDL Document Architecture August 1994
Security
Security issues are not discussed in this memo
[1] Turner, W., "Cornell Digital Library Document Architecture
Version 1.1 - 3/22/94", Library Technology Department,
University
Author's
William
Library
502 Olin
Cornell
Ithaca, NY 14853
Phone: 607-255-9098
Fax: 607-255-9346
EMail: wrt1@cornell.
Turner [Page 10]
if you see any problems within the linking, don't worry be happy,
this is version 0.1 of the Relevance System and you gotta expect some crappy subroutines sometimes,
just be content we did not write this in Java, which would have made this "bigger and better" HAHAHHA.
RFC documents can be found at I.E.T.F.
Relevance System Copyright © 2002 Spectrum WorldResearch
other technical nosh by ServerMasters Corporation
collaboration of BobX