Using URL aliases ================= mnoGoSearch has an alias support making it possible to index sites taking information from another location. For example, if you index local web server, it is possible to take pages directly from disk without involving your web server in indexing process. Another example is building of search engine for primary site and using it's mirror while indexing. There are several ways of using aliases. Alias indexer.conf command -------------------------- Format of "Alias" indexer.conf command: Alias E.g. you wish to index http://search.mnogo.ru/ using nearest German mirror http://www.gstammw.de/mirrors/mnoGoSearch/. Add these lines in your indexer.conf: Server http://search.mnogo.ru/ Alias http://search.mnogo.ru/ http://www.gstammw.de/mirrors/mnoGoSearch/ search.cgi will display URLs from master site http://search.mnogo.ru/ but indexer will take correspondent page from mirror site http://www.gstammw.de/mirrors/mnoGoSearch/. Another example. If you want to index everything in udm.net domain and one of servers, for example http://home.udm.net/ is stored on local machine in /home/httpd/htdocs/ directory. These commands will be useful: Realm http://*.udm.net/ Alias http://home.udm.net/ file:/home/httpd/htdocs/ Indexer will take home.udm.net from local disk and index other sites using HTTP. Different aliases for server parts ---------------------------------- Aliases are searched in the order of their appearence in indexer.conf. So, you can create different aliases for server and it's parts: # First, create alias for example for /stat/ directory which # is not under common location: Alias http://home.udm.net/stat/ file:/usr/local/stat/htdocs/ # Then create alias for the rest of the server: Alias http://home.udm.net/ file:/usr/local/apache/htdocs/ Note that if you change the order of these commands, alias for /stat/ directory will never be found. Using alias in Server command ----------------------------- You may specify location used by indexer as an optional argument for Server command: Server http://home.udm.net/ file:/home/httpd/htdocs/ Using alias in Realm command ---------------------------- Aliases in Realm command is a very powerful feature based on regular expressions. The idea of aliases in Realm command implementation is similiar to how PHP preg_replace() function works. Aliases in Realm command work only if "regex" match type is used and does not work with "string" match type. Use this syntax to write Realm aliases: Realm regex Indexer searches URL for matches to URL_pattern and build an URL alias using alias_pattern. alias_pattern may contain references of the form $n. Where n is a number in the range of 0-9. Every such reference will be replaced by text captured by the n'th parenthesized pattern. $0 refers to text matched by the whole pattern. Opening parentheses are counted from left to right (starting from 1) to obtain the number of the capturing subpattern. Example: your company hosts several hundreds users with their domains in the form of www.username.yourname.com. Every user's site is stored on disk in "htdocs" under user's home directory: /home/username/htdocs/. You may write this command into indexer.conf (note that dot '.' character has a special meaning in regular expressions and must be escaped with '\' sign when dot is used in usual meaning): Realm regex (http://www\.)(.*)(\.yourname\.com/)(.*) file:/home/$2/htdocs/$4 Imagine indexer process "http://www.john.yourname.com/news/index.html" page. It will build patterns from $0 to $4: $0 = 'http://www.john.yourname.com/news/index.htm' (whole patter match) $1 = 'http://www.' subpattern matches '(http://www\.)' $2 = 'john' subpattern matches '(.*)' $3 = '.yourname.com/' subpattern matches '(\.yourname\.com/)' $4 = '/news/index.html' subpattern matches '(.*)' Then indexer will compose alias using $2 and $4 patterns: file:/home/john/htdocs/news/index.html and will use the result as document location to fetch it. Using AliasProg command ----------------------- You may also specify "AliasProg" command for aliasing purposes. AliasProg is useful for major webhosting companies which want to index their webspace taking documents directly from a disk without having to involve web server in indexing process. Documents layout may be very complex to describe it using alias in Realm command. AliasProg is an external program that can be called, that takes a URL and returns one string with the appropriate alias to stdout. Use $1 to pass URL to command line. For example this AliasProg command uses 'replace' command from MySQL distribution and replaces URL substring "http://www.apache.org/" to "file:/usr/local/apache/htdocs/": AliasProg "echo $1 | /usr/local/mysql/bin/mysql/replace http://www.apache.org/ file:/usr/local/apache/htdocs/" You may also write your own very complex program to process URLs.