Complexity of Indexing Web-Pages


We have extracted all the Ontology terms such as “Term Relevance Value” for each Web-page content and store them with respect to their P_ID while creating domain specific Web-page repository by the crawler. For that reason, we can find Dominating (D) and Sub-dominating (SD) Ontology terms by executing simple SQL query with a constant complexity. Suppose, we are maintaining “k” number of Ontology terms in our analysis and assumed that all the Ontology terms are kept in sorted order. Now using the binary search algorithm, we can find D and SDs attachment position. As we are using the binary search algorithm, the complexity of finding one D and four SDs attachment position we have required 5 log2 k times. We also assumed that attaching Web-pages with primary and secondary attachment is required constant time.

User Interface

In our proposed search engine, we have facilitated Web-searchers to customize their search result by selecting all the inputs including personal development goals for leaders. We have used dropdown lists for selecting dominating and sub-dominating Ontology terms. Web-searcher can produce optimistic search results from our proposed search engine without knowing the domain knowledge because all the Ontology terms are already available in the dropdown lists. After providing all the inputs, i.e. search tokens, relevance range and number of search results, Web-searcher needs to click on Search button to get the search results. A part of the user interface of our prototype and “*” denotes mandatory fields.

“Number of Search Results” field restricts the Web-searcher to produce limited search result. For example, say 100 search results are produced for user-given search tokens and relevance rage, now user wants 20 search results that time user needs to put 20 in the “Number of Search Results” field. Lesser time will be taken to display 20 result links instead of displaying 100 result links. In the user interface, the maximum relevance value and minimum relevance value are set dynamically according to the practical scenario based data or query.

Web-Page Retrieval Mechanism Based on the User Input

Web-page retrieval from Web search engine resources has an important role of a Web search engine. We are retrieving a resultant Web-page list from our data that store information on the basis of the user-given dominating and sub-dominating Ontology terms, relevance range, etc. Most of the cases in the existing search engines follow to parse the search string and then retrieve the Web-pages based on those parsed tokens. According to our prototype, we are giving flexibility to the users that they do not use the search string rather directly select the search tokens from the drop-down lists.

As a result, it reduces the search string parsing time and miss hit ratio due to user’s inadequate domain knowledge. As discussed in Sect. 3.4, at a time user can select only one dominating and four sub-dominating Ontology terms. Our prototype uses below formula to produce a resultant Web-page list based on the user-given relevance range.

  • (50% of “x” from the primary attachment list of dominating Ontology term +
  • 20% of “x” from secondary attachment list of first sub-dominating Ontology term +
  • 15% of “x” from secondary attachment list of second sub-dominating Ontology term +
  • 10% of “x” from secondary attachment list of third sub-dominating Ontology term +
  • 5% of “x” from secondary attachment list of fourth sub-dominating Ontology term),
  • Where “x” denotes “Number of Search Results” in the user interface.


Experiment Procedure

Performance of our system depends on various parameters, and those parameters need to be set up before running our system. Parameters such as domain relevance limit, weight value assignment, Ontology terms, domain-specific Web-page repository, etc., have been considered as an input in our analysis. These input parameters were chosen by tuning our system through experiments. We have created domain-specific Web-page repository by taking 50 seed URLs is an input of our domain-specific Web search crawler.

Time Complexity to Produce Resultant Web-Page List

We have considered “k” number of Ontology terms. We have kept them in a sorted order according to their weight value. While finding user-given dominating Ontology term primary attachment link, our prototype required at most O(log2 k) times by applying binary search mechanism (refer Fig. 2). Similarly to find other four user given sub-dominating Ontology terms, i.e. secondary attachment links our prototype required 4O(log2 k) times. In the second level, our prototype reached from primary and secondary attachment to the Web-pages just spending constant time because there is no iteration required. Finally, our prototype time complexity becomes [5O(log2 k) + 5c] ≈ O(log2 k) to the retrieve resultant Web-page list, where “c” is a constant time required to reach the primary and secondary attachment to Web-pages.

Experimental Result

It is very difficult to compare our search results with the existing search engines. Most of the cases, existing search engines do not hold domain-specific concepts. It is very important that while comparing two systems both are on the same page, i.e. contains same resources, environment, system platforms, search query all are same. Few existing cases, where search engine gives an advanced search option to the Web-searchers, but did not match with our domains. However, we have produced few data to measure our proposed prototype performance. To produce the experimental results, we have compared the two systems (before and after applying Web-page indexing mechanism) performances.

All the Web-pages are indexed according to their dominating and sub-dominating Ontology terms. According to our prototype, we are giving a flexibility to the user who does not use the search string, directly select the search tokens from the dropdown lists. As a result, it reduces the search string parsing time and miss hit ratio due to user’s inadequate domain knowledge. This prototype is highly scalable. Suppose, we need to increase the supporting domains for our prototype, then we need to include the new domain Ontology and other details like weight table, syntable, etc., of that Ontology. In a single domain, there does not exist a huge number of Ontology terms. Hence, the number of indexes should be lesser than a general search engine. As a result, we can reach the web-pages quickly as well as reducing index storage cost. Further, our experimental analysis suggests that the Web-page indexing mechanism produces faster result for the user selected dominating and sub-dominating Ontology terms. In the next post, we present detail architecture of Ontology-based domain-specific Web search engine for commonly used products using RDF.


Comments