1.1. Introduction: World Wide Web (WWW) [1] is a huge source of interlinked documents that forms a very useful information source. The success of WWW is largely due to its decentralized design structure [2] where the information is hosted by several servers, and a document can point to other documents irrespective of its geographic location. An information retrieval [3, 4] is a technique for searching the information about a subject over an enormous number of resources relevant to the user’s information need. Information retrieval can be precisely defined as: Search Engine

“Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)” [4].Search Engine

WWW has revolutionized the means of data availability. But due to its current structure [5] it’s getting difficult to access the relevant information from such a large collection. The Web site has grown to a large extent and due to the large volume of available information; it is becoming difficult to locate useful information [2, 6]. Retrieving the relevant information from WWW is an unprecedentedly difficult task.

With such a large collection of information, search engines [7] are emerging as an important tool for searching the relevant information. The information is searched through a search engine by submitting queries that are in the form of keywords and as a result information seekers find the required information. Thus, search engines are considered as an important tool for information retrieval system that returns a set of ranked web pages according to their relevance and matches the query keywords.

1.2 Search Engines:

Search Engine is a tool that is used to retrieve the information stored over the WWW. Typically Search Engine has the following main components:

1.2.1 Crawling: It is the first stage of search engine in which the documents from the web are downloaded based on the URL received from the URL Frontier Queue [8]. The web pages fetched from the web are sent for parsing, for further extraction of links. The extracted links are sent to URL Frontier Queuefor fetching of web pages from those links after passing through a series of tests of duplicate contents and URL elimination.

1.2.2 Indexing: The crawled web pages are then indexed by the Indexer Module. The major steps involved in index construction are -Tokenization, linguistic pre-processing process such as hyphenation, stop word removal, stemming, lemmatization, normalization [4]. These terms are sorted and maintained as a posting list consisting of the frequency of the terms and the document that each term occurs in. Different types of indexes are constructed depending upon the type of contents; Text Index, Structure Index, Utility Index [7].

1.2.3 Searching: Query terms entered by the user are compared with the index, producing the results. When a user query is entered, the terms of the query are matched with the terms in the index structure and the terms matching the query terms are returned as a result to the user.

1.2.4 Ranking: The web pages returned after matching with a query are ranked based on various factors. The most widely used ranking algorithms are Page-Rank and Hypertext Induced Topic Specific (HITS) algorithm.

The search engines, for example, Google, Yahoo, etc. match the keywords in the query with the web pages that are having those keywords, resulting into a result page set which has relevant and irrelevant results. Retrieving the relevant information from the information available is an important research issue in search engines.

1.3 Limitations of the traditional search engines:

Major search engines such as Google, Yahoo works on keyword-based matching [9]. It is the user’s work to extract out the relevant information from a large set of results. Finding out the relevant information from such a large set of web pages proves out to be a very tedious task. Search engines based on keyword matching have certain problems associated with them [10, 11, 12, 13, 14] as listed below:

1. High recall, low precision:

The main issue with the returned results is that they have high recall but low precision which means that it returns a lot of important results from its repository but those results are not that relevant which refers to low precision. But with a lot of results retrieved is that even if the main relevant pages are retrieved, they are of little use if large numbers of mildly relevant or irrelevant documents are also retrieved.

2. Low or no Recall:

Often it happens that users don’t get any relevant answer for request, or important and relevant pages are not retrieved.

3. Lack of machine Understandability:

The machine has the inability to understand the provided information due to a lack of universal format [15]. The information is based on HTML based free format web pages which are very suitable for direct human use but are not appropriate for automated information exchange, retrieval, and processing by software agents(machines). The current web contents are mostly represented in HTML which is more presentation language and henceforth, does not help in machine interpretability.

4. Poor Content Aggregation:

For the query entered the results are a lot of documents or web pages; a user has to manually aggregate the partial information to get the complete information. Hence, search engines return a lot of results which has to be manually aggregated.

5. No Semantics:

Results are based on just matching the keyword in that document. There is no concept based matching of the query with the documents. Therefore, the results may or may not be relevant in the context of semantic to the user query.

6. Difficulty in handling queries with dis-ambiguous terms.:

The current search engines match the query keywords with the keywords present in the document. For example query, “jaguar” has two different meaning cars as well as animals and hence, produces results for both the documents, leading to low precision. Similarly, the query “holiday “and “vacation” relate to the same term but when entered separately produces a different set of results although referring to the same word.

The limitations specified above mentions that just matching keywords do not help in searching; it produces a lot of imprecise results. The efficient searching requires the machine to understand the semantics of the information. This machine understandability concept can help WWW to make a move from the syntactic web [16] to Semantic Web [16, 17].


For citation use reference:

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *