Artemis Project: Automated Web Page Processing

THE ARTEMIS PROJECT
PRIVATE ENTERPRISE ON THE MOON

Web Site Design
Section 9.4.2.

Home

Tour

Join!

Contents

Team

News

Catalog

Search

Comm

Automated Web Page Processing

Historical Note

This document is getting a bit dated. We retain it in the Artemis Data Book because of its historical interest, and because even now people enjoy reading the tragic tale of John and Gillian Hillclimber.

I wrote the following note some time around 1996. It planted the seed for the ASI Web Management System system, which eventually grew up to be the revolutionary WebSite Director and led to the creation of CyberTeams, Inc. as one of the world's leading network software companies.

With a team spread out all over the world, with respresentatives on every continent, we had to create a revolution in cyber space before we could create a revolution in outer space!

The system I described below isn't the way WebSite Director really works. The final web management system stores its source data in a full data base system. (Technically speaking, it can use mSQL or any any ODBC-compliant database.) It maintains the tables of contents with an elegant template system and has its own template-processing language. This gives WSD so much flexibility that it can be used to efficiently maintain and control any site on the World Wide Web. However, WSD still provides most of the functions I requested in this original notes. Some of the additional features will appear in a companion software package still under development at CyberTeams.

Theory

We need to decrease the amount of effort we're putting into maintaining the web pages in the Artemis Data Book. Having a cgi script piece together pages out of master lists of email addresses and common header/footer code would save us enormous amounts of labor -- time which we can spend getting to the moon.

We could have separate text files named "foobar.body", which get incorporated into "foobar.html" files. The indexer would only list the "foobar.html" files it finds in a given directory.

     +--------------------------------------------------------+
     |                         Header                         | added
     |  Tags: html, doctype, keyword comments, head, body     | by
     |  Output: page title, doc section number, doc name      | machine
     |          author, standard decorations                  |
     +--------------------------------------------------------+
     +--------------------------------------------------------+
     |                          Body                          |
     |  Tags: author, maintainer, title, doc name, etc.       | maintained
     |  Output: body of document, in-lined images, internal   | by
     |          navigation links                              | human
     |                                                        |
     |  Note: Info tags don't need to be .html file, only in  |
     |        .body file                                      |
     +--------------------------------------------------------+
     +--------------------------------------------------------+
     |                         Footer                         |
     | Tags: /body, /html                                     | added
     | Output: navbar, author link, maintainer link, mod date | by
     |         copyright declaration, link to local index     | machine
     |                                                        |
     +--------------------------------------------------------+

To make this work, we need to establish a standard format for telling the indexer the name of each document. That could be a separate file listing all the foobar.body files and the document name for each, or it could be a standard-format comment field contained in each foobar.body file. I think it would be easiest to maintain the documents if we established a standard form for putting comments in the foobar.body documents.

The standard format for tags in a .body document might be:

     <!-- asidocname="Subject of this essay" -->
     <!-- asidocsection="4.6.5.3.1" -->
     <!-- asiauthor="H. W. Wrotethis -->
     <!-- asimaintainer="I. M. Aitchtiemell" -->
     <!-- asicopyright="Mark Territory, Inc." -->
     <!-- asiorigdate="mm-dd-yyyy" -->

The text "asi" is added to the tag names because future version of html might add some keywords, but it's unlikely they'll use "asi" in any of them. These comments need to be in the foobar.body files, but do not need to be the foobar.html file that the indexing engine assembles.

Including the ADB section number allows us to do some interesting things:

Indexer adds the document section number to the titles at the top of the page, and picks up section title (not to be confused with the name of the document it's processing);
New documents submitted via anonymous ftp could remain in a holding tank until authorized webmaster's apprentice tells a script to add it to the Artemis Data Book. That script could parse the docsection to determine the correct destination directory, move the file there, add the new filename and asidocname to the list of active documents in that directory, and tell the indexer that it just added a new document.

This means the guys maintaining the documents only need to worry about making the actual document content work. If we decide to change the style of headers and footers, they'll change consistently throughout the whole web site.

Also, if the *.body documents contained a 50 word abstract of the document, it wold allow us far more flexibility in file lists, search engines, and updates on the Artemis Data Book, like the whatsnew.html file.

It also makes it easy for someone to submit a new document -- just anonymous ftp it to a holding tank, and then the ECTC webbers take over. That gives us the same control as the sysops do on GEnie. If the webmaster's apprentice isn't sure, he'll know the section boss for that part of the Data Book, and can easily ask if the document should be added (or updated) by emailing a URL pointing to the file in the holding tank.

To make it really easy to get data into the web, the script might even check to see if the *.body file contains html tags. If not, it could assume it's plain text and automatically surround it with <pre></pre>.

Server Time

Perhaps instead of having a cgi script that runs every time someone accesses a URL, it would be better to have a script that builds the html file and updates the indexes when things change:

When a document is added or changed, update indexes from that point on up the directory tree.
When a person's address changes, scan the whole site for references; or just rebuild all the pages and indexes.

We want to reduce server-side processing time. There's no need to assemble a page each time it's delivered, if that page will only change once a year. At the cost of a few KB to store duplicate copies of the Nav Bar all over the site, we can eliminate the need for processing every page with a script.

The tricky part will be making the pages maintainable. We might want to split it out so that each html document has three parts: header, footer, and body. The body is the part maintained by humans, while the header and footer are maintained by machine.

An Example

In the file, dihydrogen-monoxide-acquisition.body, we have the following code:

     <p><a href="mailto:<asilookup
emailaddress="hillclimber.john">>
     <asilookup nickname="hillclimber.jack"></a>
     and
     <a href="mailto:<asilookup emailaddress="hillclimber.gillian">>
     <asilookup nickname="hillclimber.jill"></a>
     went up the hill to fetch a pail of water.</p>

Now, the script that's building dihydrogen-monoxide-acquisition.html knows to watch out for "<asilookup ...>" tags. It also knows that the standard format for files in the personnel directory is:

                        Jack's file              Jill's file
       Item             hillclimber.john.data    hillclimber.gillian.data
       ------------     ---------------------    ------------------------
       filename         hillclimber.john         hillclimber.gillian
       formalname       John Q. Hillclimber      Gillian A. Hillclimber
       nickname         Jack                     Jill
       emailaddress     jack@waterfall.net       gillian@bucket.com
       whateverelse     Fragile skull            Likes to roll down hills

Carrying this theme out to the next step, the personnel directory would contain data files from which the *.html biographies in the /bios directory are assembled.

So, the script can go to the personnel directory, look up lines 2 and 3 in file hillclimber.john.data, and find the text that should replace the "asilookup" tags. It outputs file dihydrogen-monoxide-acquisition.html containing the following code:

     <p><a href="mailto:jack@waterfall.net>
     Jack</a>
     and
     <a href="mailto:gillian@bucket.com>
     Jill</a>
     went up the hill to fetch a pail of water.</p>

When this is read by a browser, it comes out with the "mailto" links embedded in the displayed text, thus:

     Jack and Jill went up the hill to fetch a pail of water.

If we wanted to add a link to Jack's biography and also the full text of Jack's email address, we'd look up the address twice, thusly:

     <p>This page is maintained by
     <a href="/bios/hillclimber.john.html>
     <asilookup formalname="hillclimber.john">
     </a>
     <a href="mailto:<asilookup emailaddress="hillclimber.jack">>
     & lt;<asilookup emailaddress="hillclimber.jack">& gt;
     </a>.</p>>

The page-builder would output:

     <p>This page is maintained by
     <a href="/bios/hillclimber.john.html>
     John Q. Hillclimber
     </a>
     <a href="mailto:jack@waterfall.net>
     & lt;jack@waterfall.net& gt;
     </a>.</p>

So the browser would display:

     This page is maintained by John Q. Hillclimber<jack@waterfall.net>.
                                ------------------- --------------------
                                       |                     |
                                link to Jacks's bio    mailto: link

Why This is Important

Now, Jack is a hydrological engineer on disability retirement (sadly, he broke his skull on the job). He has his own ISDN line at home, and wants to move to the moon in a few years. He loves to do this web stuff, so helping with the ASI web site is his main hobby. He's such a ubiquitous presence throughout the web site that he's the designated maintainer of a couple of hundred pages. (He can keep up with that because once they're developed most of the pages don't change very often.) Of course Jack is also actively involved in designing life support systems and long-range planning for the lunar community's water works. (He is particularly interested in personnel safety.)

The trouble is, Jack's email address changed when he had that industrial accident; had to shift to an ISP instead of using his company address. Then his ISP went bankrupt, so he started his own ISP company (waterfall.net), and his email address changed again.

The good news is that since someone spent a few hours coding up the script that builds the *.html files and handles the "asilookup" tags, we only had to update Jack's email address in one place, and then run the script to rebuild the *.html files. That saved thousands of hours, which otherwise would have been invested in manually updating all those email links in all the files Jack maintains and then finding all the bugs introduced by human error.

A Step Farther

Another time-saving possibility is to establish a standard format for referring to a person and have the html-builder insert all the stuff we need whenver we refer to a person. That would include the person's name, nickname, a link to the personal bio page, and a mailto reference.

For instance, suppose someone wanted to include a reference to a hypothetical person called "Greg Bennett". In this specific application, the output we want is ...

     Director, Office of Space Flight
         Gregory Bennett (Greg) <grb@asi.org>
         ---------------          -----------

The "Gregory Bennett" link points to /bios/bennett.gregory.html. The other one is obviously a mailto link. The nickname, in parenthesis, is inserted so folks know what to call Greg in a personal greeting. (That knowledge really enhances communication, like name tags at a convention.)

To make this happen, in the *.body file, the html writer would just use a flag in the "asilookup" tag to tell the html-builder routine to include a full personal reference. So, in the *.body file, the code would look like this:

     <ul>
       <li>Director, Office of Space Flight
       <ul>
          <li><asilookup personal-ref="bennett.gregory">
       </ul>
     </ul>

When it sees that "asilookup" tag, the html-builder goes out and grabs the formal name, nickname, and email address from file /personnel/bennett.gregory and assembles the data in a standard output format for a personal reference.

Note that text formatting -- paragraphs, indentations, text style -- are the responsibility of the person coding up the *.body file. All we're providing here is a standard-format personal reference. The html-builder program doesn't care if it finds the asilookup tag in the middle of a paragraph, in a table, or even hiding in the footer. (There will be lots of these in footers, in small type. The html-builder program itself can use this block of code for standard personal reference.)

We'll probably still want to be able to look up invidual items like names, email address, and so on, but a single specification that outputs a personal reference in a standard format will make life lots easier as time goes on.

The idea of including a nickname came from lots of companies who use the convention of putting a nickname in parentheses on badges, so we can file the serial numbers off and claim that convention as our own. The html-builder would check to see if nickname = firstname when it grabs the data from the /personnel directory, and insert the nickname in parentheses if they're not the same.

Strategic thinking

If the Electronic Communications Technical Committee can set this up, it will make maintaining the pages that contain any list of people's names much easier. As we get the organization charts on line, we'll have lots of lists of names in each committee and organizational element.

The personnel directory files don't have to be people, either. We could include companies, organizations, and mailing lists as well. The important thing is to be able to write one of those *.body files knowing the html builder will look up the most current data.

| Previous document | Table of contents for this section | Next document |

Artemis Project

Artemis Society

Artemis Data Book

FAQ

Reference Mission

Catalog

Copyright © 2007 Artemis Society International, for the contributors. All rights reserved. Updated Wed, Apr 8, 1998
Content provided by Gregory Bennett , maintained by ASI Web Team <asi-web@asi.org>.
Maintained with WebSite Director. - "The Artemis Project" is a trademark of The Lunar Resources Company.