A simple In-Page Javascript Search Engine

During development of an intranet bibliographic search and display application we wrote a very simple, yet effective, specialised search engine to retrieve bibliographic records stored in a single HTML page. The bibliographic "database" is stored as plain css formatted html and is generated automatically from a spreadsheet. But this is not the subject of this blog entry, although the structure of the bibliographic records in the HTML page is important and is detailed below.

This blog entry presents a complete solution for a search engine written in Javascript/css inside a single html page. No server-side components. It extracts its information from a html element storing keywords for a record. To emphasise, this is not a general purpose search engine, but a specialised one that acts on data that have a given structure.

We start by defining the data structure the search engine will act on, followed by a description of the visual elements of the search bar, then we present the Javascript code of the search engine itself and finally we give a working example.

The example is here (right-click then Save Link As... to download it). This is a HTML file including code and data.

Data Structure

Bibliographic records generally include a keywords field. This field is a list of words that are significant for the bibliographic record and is used to locate the record while searching for words. In our case the keywords field also includes the author's details. Keywords lists are generally generated automatically by some kind of keyword extraction and ranking code (for example in python, see topia.termextract).

Let's say that a bibliographic record is made of the following field:

Title
Author's name
Date
Link to paper
Keywords list

That record is converted into the following html structure:

<ol>
<div class="theData" style="display:none;">
    <li>Title - <span class="author">Author's details</span> - <a target="_blank" href="link_to_paper">here</a> (date)</li>
    <span class="BCdico" style="display:none;">list of keywords in the paper separated with a space and all lowercase</span>
</div>
<ol>

First, the enclosing <div> element is of class "theData". This class is not styled with css but is used for referencing in the Javascript code. By default the <div> element is not displayed in the browser (style="display:none;"). Then an <li> element is used to store the title, author's name, link and date of the record. Finally, the last <span> stores the keywords list. Note its class name and that it is not displayed in the browser either. It will stay hidden when the record is showed to the user.

The <ol> element encloses all the bibliographic elements.

Styling of the record is done with the following css rules:

<style>
ol li { border-left:1px solid blue; border-bottom:1px solid darkgrey; padding-bottom:3px; padding-left:5px; }
strong { color:black; font-weight:bolder; }
.BCdico { display:none; }
.author { color:#000066; font-weight:bold; }
</style>

We have just defined the structure of a bibliographic record stored in the HTML page. Next let's see how the search bar is created.

The search bar

In order to search our data we need some kind of visual interface to enter search words and options, this is the search bar as shown below:

Search capabilities include:

"Or" (default) return en element if one search word is found in its keywords list
"And" return en element if all the search words are found in its keywords list
"Whole words". By default any part of a keyword matching a search word is returned. Enabling this option dictates that the seach word must match an entire keyword to be returned.
A search word prefixed with a + must be present in the keywords list for that element to be returned.
A search word, prefixed with a -, found in the keywords list for an element does not return this element, even if other search words are in the keywords list.

Limitations: no approximate words, no ranking.

And the html code to create it is:

<!-- The search bar -->
<hr />
 <form action="/">
  <table border="0" cellpadding="0" cellspacing="5" bgcolor="#e9e9e9" style="background-color: #e9e9e9; padding-bottom: 3px; border-bottom: 1px solid darkgrey; width: 100%;">
  <tr>
   <td align="right" valign="middle"><input id="theSearchString" type="text" placeholder="Search.." name="searchS"></td>
   <td align="left" valign="middle"><button id="buttonStr" type="button" onClick='doSearch("")'><svg width="18" height="18"><circle r="5" cx="6" cy="6" id="C1" style="stroke:#000000;stroke-width:2;fill:none;" /><line id="L1" x1="16" y1="16" x2="9" y2="9" style="stroke:#000000;stroke-width:3;fill:none;" /></svg></button></td>
   <td align="center">
    <input type="radio" id="ro" name="ao" value="ro" checked /><label for="ro">Or</label>
    <input type="radio" id="ra" name="ao" value="ra" /><label for="ra">And</label><br />
    <input type="checkbox" id="ww" name="ww" /><label for="ww">Whole words</label>
  </td>
 </tr>
 </table>
</form>
<p style="padding-left:16px">Use the search bar to find what you are looking for!!  -  Type * in the search bar to display all entries.<br />
You can use the <strong>+</strong> or <strong>-</strong> prefixes to oblige word(s) to be present or exlude them, respectively, from the search.</p>
<hr />
<!-- end search bar -->

As usual in html, the search bar is made from a <form> element and its <input>, <label> and <button> children elements and we use a <table> element to format the display instead of cumbersome <div> elements.

We use an input element of type "text" for the search box, two input elements of type "radio" to select the "And" or the "Or" option, but not both. And, an input element of type "checkbox" to select/deselect the "Whole Words" option.
<label> elements are used to display some text to the right of the <input> elements. Note the for= keyword: it defines the <input> element the <label> refers to. for= points to the id of the <input> element.
To initiate a search the user must click the search button defined as a magnifying glass ( ) in svg format to the right of the search box. A <button> element is used for that.

Below the search bar, some help is displayed to present the other options of the search engine.

The search bar is styled with css as defined below:

<style>
table { width:960px; padding-top:0px; padding-bottom:10px; padding-left:40px; padding-right:20px; background-color:#FFF; }
input[type=text] { padding:8px; margin-top:4px; font-size:15px; border:none; }
button { background:#ddd; font-size:18px; border:none; cursor:pointer; margin-top:4px; padding-top:8px; padding-bottom:8px; }
button:hover { background: #CCC; }
#theSearchString { margin-bottom:0px; width: 600px; }
</style>

The <table> element is given a fixed width (960px), a background colour (white = #FFF in hexadecimal) and some padding space inside it.
<input> elements of type "text" (input[type=text] in css ) are given some padding space and no border.
<button> elements are given some padding and margin spaces, a background colour and a cursor icon when the mouse is over them. To be pleasing, the background colour of the <button> elements changes when the mouse hovers over them (button:hover rule).
To be more focused on an element, the search box is styled using its id (#theSearchString, in css # refers to the id of the html element)

The search bar is now defined. Next we are going to detail how the code works.

Javascript search engine code

The Javascript code is articulated around the following 5 functions. The main function being "doSearch{}". The others functions are helper functions to make the code more readable.

<script language="javascript">
// The search code
function hideResults() { ... }       // Hide previous results
function myReplace(t) { ... }        // Replace words by others
function toSpace(t) { ... }          // Replace ,:\;tab by spaces
function bgetOrder(text, d) { ... }  // Check for + and - options
function doSearch() { ... }          // Search for words. All matching words return an entry ('or'-type search)
</script>

The following code snippets are commented code to detail the logic used.

The helper functions

hideResults(), when called, resets the visibility of every record to hidden. The html page is returned to an empty state.

function hideResults() {
// Hide previous results
    var theData = document.getElementsByClassName('theData');  \\ Get all DOM elements with class="theData" into an array
    for (var i=0; i<theData.length; i++) {                     \\ Loop through each element and set its visibility to hidden
        theData[i].style.display = "none";
    }
}

Next, the myReplace() function replaces words by others, so that user's search words are altered to match words in the keywords list. In our case, we replace "add-on" with "addon", "how to" with "howto", etc.

function myReplace(t) {
// Replace words by others in the user's search string
var aro = ['ori1', 'ori2', ...];       // list of words to be changed. E.g. add-on
    var arr = ['new1', 'new2', ...];       // list of replacement words.   E.g. addon
    for (var i=0; i<aro.length; i++) {     // Loop through the list of words to be replaced
        var rg = new RegExp(aro[i],'g');   // and replace them with the substitute words
        t = t.replace(rg, arr[i]);         // using a regex expression
    }
    return t;                              // return the new search string
}

toSpace() converts the ',', ':', '\', ';' and tab (\t) characters into spaces ' '.

function toSpace(t) {
// Replace ,:\;tab by spaces
    var rg = new RegExp(',|:|\\||;|\\t','g');  // Use a regex expression to replace the characters with spaces
    return t.replace(rg,' ');                  // return the new string
}

bgetOrder() creates a list of words that are compulsory (when d = '+') or must be rejected (when d = '-') and returns that list and an altered search string.

A search word prefixed with a + must be present in the keywords list for that element to be returned.
A search word, prefixed with a -, found in the keywords list for an element does not return this element, even if other search words are in the keywords list.

function bgetOrder(text, d) {
// Check for + and - options
    var lst = [];                                       // define the array storing the tagged words
    var ar = text.split(' ');                           // break the text variable into an array (ar), using space as separator
    for (i = 0; i < ar.length; i++) {                   // Loop through each element of the ar array
        var t = ar[i];                                  // store current array element into variable t
        if ((t.trim() != '') && (t.slice(0,1) == d)) {  // if +/- is at the beginning of the word
            lst.push(t.slice(1));                       // store the word into the lst array
        }
    }
    while (text.indexOf(d) > -1) {                      // remove +/- from each word in text
        text = text.replace(d, '');
    }
    return [lst, text];                                 // return an array made of lst and the (modified) original text
}

The main function

The main function is articulated around the following steps:

Get user's search string;
Get the keywords list from the DOM (Document Object Model);
Get the state of the options;
Clear the page (hide all previous results);
If the text string is empty, exit the function;
If the user wants to returns all records (search string = *):

Loop through all records in the html file and display them.

Else:
1. Format the search string using helper functions;
2. Check if + or - options are present and build relevant arrays if they are;
3. Check the and, or and Whole Word options and define the adequate search criteria;
4. Define the regular expression (regex) from step c above;
5. For each record, use the regex to find if the search words are present in its keywords list;
  - If present, check the + and - conditions, raising a flag if they are met;
  - If flag is True, then display the record;
  - otherwise keep the record hidden.

We use the power of regular expression (regex) to define the search criteria for each combination of options (and, or, whole words).
For example the search string is: findword1 findword2. The following table shows the regex generated for each option combination:

	default	Whole Words
Or	(findword1\|findword2)	(\bfindword1\b\|\bfindword2\b)
And	(?=[\s\S]*findword1)(?=[\s\S]*findword2)	(?=[\s\S]\bfindword1\b)(?=[\s\S]\bfindword2\b)

where:

| is the or keyword
/b stands for word boundary
(?=[\s\S]*...)(?=[\s\S]*...)(... defines an and condition

The code is as follows:

function doSearch() {
// Search for words. All matching words return an entry ('or'-type search)
//
// for 'and'-type:
//   prefix = "(?=[\s\S]*"  or "(?=[\s\S]*\b" to match exact words
//   suffix = ")"           or "\b)"
// then
//   text = prefix + text.replace(" ", suffix+prefix) + suffix;
// finally
//   var rg = new RegExp("(" + text + ")");
//

// step 1
    var text = (document.getElementById('theSearchString').value.toLowerCase());
// step 2
    var theDico = document.getElementsByClassName('BCdico');
// step 3
    var and = document.getElementById("ra").checked;
    var allWord = document.getElementById("ww").checked;
    // Following 2 variables are used to beautify the results...
    var o = 0;                           // results are presented in rows
    var cc = ['#e5ffe5', '#efffef'];     // with alternating background colours
    //
// step 4
    // Hide previous results
    hideResults();
    //
    text = text.trim();
    //
// step 5
    if (text == '') {
        return 0;
    } else if (text == '*') {
// step 6
        for (var i=0; i<theDico.length; i++) {
            theDico[i].parentElement.style.display = "block";
            theDico[i].parentElement.style.backgroundColor = cc[o];
            o = (o+1) % 2;
        }
    } else {
// step 7
        // Show results, if any
// step 7a
        text = myReplace(text);
        text = toSpace(text);
        var dicP = []; 
        var dicM = [];
// step 7b
        if (text.indexOf(' +') > -1) {
            [dicP, text] = bgetOrder(text,'+');
        }
        if (text.indexOf(' -') > -1) {
            [dicM, text] = bgetOrder(text,'-');
        }
// step 7c
        if (and) {
            var prefix = "(?=[\\s\\S]*";
            var suffix = ")";
            if (allWord) {
                prefix += "\\b";
                suffix = "\\b" + suffix;
            }
            text = prefix + text.replace(/ /g, suffix+prefix) + suffix;
            var rg = new RegExp(text);
        } else {
            if (allWord) {
                text = "\\b" + text.replace(/ /g,"\\b|\\b") + "\\b";
                var rg = new RegExp("("+text+")");
            } else {
                var rg = new RegExp("("+text.replace(/ /g,"|")+")");
            }
        }
// step 7e
        for (var i=0; i<theDico.length; i++) {
            var txt = theDico[i].innerHTML;
            if (txt.search(rg) > -1) {
                var b = true;
// step 7e1
                if (dicM.length > 0) {
                    // do not show text containing unwanted word(s)
                    for (var j=0; j<dicM.length; j++) {
                        if (txt.indexOf(dicM[j]) > -1) {
                            b = false;
                            break;
                        }
                    }
                }
                if (dicP.length > 0) {
                    // do not show text that do not contain wanted word(s)
                    for (var j=0; j<dicP.length; j++) {
                        if (txt.indexOf(dicP[j]) == -1) {
                            b = false;
                            break;
                        }
                    }
                }
// step 7e2
                if (b) {
                    theDico[i].parentElement.style.display = "block";
                    theDico[i].parentElement.style.backgroundColor = cc[o];
                    o = (o+1) % 2;
                }
            }
        }
    }
}

Example

The example is built from the "Sample Annotated Bibliography" found here (Ashford University). The bibliographic records have been converted into the html structure shown at the beginning of this page, and keywords have been automatically generated for each bibliographic entry using python module topia.termextract.

The example is here.

A note on performance

The search engine is used on a page containing several thousands records. Page loading time is quite fast on the intranet and search results are presented nearly instantly when not using the +/- options. Use of these options slow down the performance of the search engine a bit. Still reasonable speed is achieved on those thousands records.

Upscaling, with many tens of thousand records is not tested. The assumption is that a decrease in performance is inevitable dur to the non-optimal code used!

Anyways, we hope this little bit of code is useful for someone out there...

Licence

The Javascript code and css/htm structure, the code, are free software: you can redistribute the code and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

The code is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

The GNU General Public License is available here.

Published date: 11 Mar 2019.