| SearchEngine: 
  Building the Dependency ListTopics
  
    
   The word database is constructed by parsing a series of HTML or 
  plain text files. This chapter explains how the dependency list is constructed, 
  specifying a start file, and the process of parsing linked files.File filters are also discussed, to avoid parsing non-text files, and to exclude 
  file types or single or multiple HTML documents from the dependency 
  list.
 File name conventions 
   Some confusion can arise when the system path separator character is not the 
  same as the URL path separator character "/". 
  There are two simple rules which help distinguish which character should be 
  used:   
  File names and pathsWhen specifying a filename or filepath, use the system path separator, for 
    example: 
    
-f filename    the root HTML filename (required)
-p filepath    intermediate data filepath
-f \www\rational\application\search\search\index.html
-p \www\rational\application\search\search\data\
 URLs
  When specifying a URL (i.e. a reference to one or more of the HTML 
    documents to be parsed) use the URL path separator, for example: 
    
-xwu url       exclude URL from word list
-xwu /www/rational/application/search/doc/index.html
 The 
  start point The -f option is the only required SearchEngine option. It tells 
  the SearchEngine the name of the HTML file to use in constructing 
  the dependency list. This file is parsed for words, and all links, such as <a 
  href="filename.html"> are then checked. Any link not specifically excluded (see removing 
  files from the list below), and which also resides on local storage, is 
  then added to the dependency list. The SearchEngine repeats this process by 
  parsing the next file on the pending dependency list until that list is exhausted.
 In order to reduce recompilation time, the SearchEngine stores the parsed information 
  to local storage, so that only HTML files which have been modified 
  since the last compilation need to be parsed again.  These intermediary data files are stored in a data directory, 
  appended to the current working directory. For example, if the current working 
  directory is:  
  
/www/rational/application/search/doc/
 and the file to parse is:  
  
/www/html/index.html
 then the intermediary data file is stored in:  
  
/www/rational/application/search/doc/data/www/html/index.html.data
 You may override this default data directory by specifying your own using the 
  -p option.  Removing documents from 
  the list In some circumstances, linked documents must be excluded from the dependency 
  list. There are many reasons for doing so; the link might not be a text document, 
  such as a reference to a .zip file, or .tar file, 
  or might be a programming language file, or an applet page.  Other examples are links to other manuals or groups of HTML documents, 
  which have their own separate search databases. How easy or difficult this is 
  to achieve depends on how the HTML documents are structured. If 
  they are all lumped together in one directory, then each document must be fully 
  specified, if they are in different directories, then it is enough to specify 
  the directory and use a filter. 
  
  Removing a specific document from the dependency list To remove a specific document from the dependency list, use the -xu 
  option, and specify the document's URL path and filename components, 
  for example:   
  
-xu /www/rational/application/search/doc/index.html
  
  Removing multiple documents from the dependency list To remove multiple documents from the dependency list, use the -xu 
  option and a filter using the wildcard character '*'. For example: 
  
  
-xu */index.html
 In this example, all URLs ending with /index.html 
  will be excluded from the dependency list.  Another more dangerous example of filtering is:   
  
-xu /www/extawt/*
 In the above example, all URLs beginning with /www/extawt/ 
  will be excluded from the dependency list.  Finally an even more dangerous example of filtering is:   
  
-xu */extawt/*
 In this example, all URLs containing /extawt/ 
  will be excluded from the dependency list.  No other combinations of the wildcard character '*' are valid. 
  A filter definition of */extawt/*remove.* will result in a (probably 
  useless) filter to ignore all URLs containing /extawt/*remove., 
  and not the probable intention of ignoring all URLs containing 
  /extawt/ and also remove.  The wildcard character '*' can appear at the start of the URL, 
  and/or at the end of the URL, anywhere else it is treated as an 
  ordinary character.  Warning: Filters can damage your database...The last two examples of filtering above are dangerous in that they can do 
  more filtering than perhaps originally intended. To understand what can go wrong, 
  we'll look at an example for a fictitious sports site. We assume that the site 
  documents are all stored in a root directory, called /sport, with 
  the various sports being sub-directories of this, for example /sport/sailing, 
  /sport/skiing, etc, and that there are links not only from each 
  sport to the root directory, but also between sports.   
  We decide that we want to build a general database, using sport 
    and all sub-directories not related to a specific sport, and several specialized 
    databases, one for each specific sport.  In the first case, we could remove all the specific sport sub-directories 
    from the dependency list by using a filter for each one;   
    
-xu /sport/sailing/*
-xu /sport/skiing/*
-xu /sport/surfing/*
...
 However, when we come to creating the dependency list for the /sport/sailing 
    sub-directory we want to remove all documents in the root /sport 
    directory from the dependency list, because they are already covered by the 
    general database. At first you might think that the following does just that; 
    
    
-xu /sport/*
-xu /sport/skiing/*
-xu /sport/surfing/*
...
 with the first line removing all documents in the root directory.  When you compile the list, starting with say /sport/sailing/index.html, 
    the resulting dependency list contains only that document. So where are all 
    those other /sport/sailing documents? The answer is that that 
    first line filter removed them from the list - remember, /sport/* 
    means: remove all documents beginning with /sport/, 
    so a document such as /sport/sailing/AmericasCup.html, which 
    does begin with /sport/ will be removed from the dependency 
    list.  How, then, can the /sport directory be removed, without having 
    to specify each document in that directory separately? You use a directory 
    exclusion filter:   
    
-xu /sport//
 By placing two URL path separator characters "//" 
    at the end of the URL, we can tell the SearchEngine to remove 
    all documents in the /sport directory from the dependency list, 
    but not documents in sub-directories of /sport.  Similarly, if you want to create a database of a sub-topic, say /sport/sailing/windsurfing 
  you will have to add directory exclusion filters for each level; /sport 
  and /sport/sailing;   
   
    
-xu /sport//
-xu /sport/sailing// Generating the 
  list In addition to the dependency list, the SearchEngine also accumulates a list 
  of non-text documents, such as images, or sound files, links to other sites, 
  and missing links.Though the SearchEngine is not strictly designed for this purpose, such a list 
  is useful for debugging a series of HTML documents for broken links. 
  The -lu filename option causes the SearchEngine to write 
  these lists to the specified filename.
 As an example, the list generated for this manual is presented below:  
  
Document dependency list for file://C:/www/rational/application/search/doc/index.html
file://C:/www/rational/application/search/doc/index.html
file://C:/www/rational/application/search/doc/applet.html
file://C:/www/rational/application/search/doc/database.html
file://C:/www/rational/application/search/doc/dependency.html
file://C:/www/rational/application/search/doc/filters.html
file://C:/www/rational/application/search/doc/installing.html
file://C:/www/rational/application/search/doc/overview.html
Data dependency list for file://C:/www/rational/application/search/doc/index.html
file://C:/www/images/green-ball.gif
file://C:/www/images/background.gif
External links list for file://C:/www/rational/application/search/doc/index.html
http://www.javasoft.com
mailto:bill@microsoft.com 
Copyright 
© 1987 - 2001 Rational Software Corporation
 |