Information Retrieval Assignment Sample

Table of Contents

Introduction
Data Set collection of Covid-19
Running instruction of Elastic Search
Indexing

Type Assignment
Downloads2061
Pages7
Words1713

Introduction

Get free samples written by our Top-Notch subject experts for taking Assignment Help UK services.

Informational retrieval is a method of processing, and way of searching, obtaining recorded data and important information from a database or file. It is achieved by locating and retrieving data and also by supplying the data through the network. "Google search" is a well-known example of an information retrieval search engine. Its main aim is to search for an exact document that relates to the user's requirement (Falcão et al. 2022). With the use of a perfect IR system, we can obtain relevant documents. This system consists of a few models such as Non-Classical models, Classical models, alternative models, and Boolean models. The classical model is a very simple and easy-to-use model and this is based on knowledge of mathematics which is easily understood and organized in a very simple way. Non-Classical models are fully opposite to that of classical models and these are based on Boolean operation and probability. An alternative model is an improved version of the classical model and it makes use of a few specific techniques which are of other fields. Boolean models are one of the oldest models of the IR system that are based on Boolean algebra and on set theory (Nurrokhim et al. 2021). A few advantages of using an information retrieval system are that it saves time by obtaining search results at a faster pace and is easy to apply.

Feeling overwhelmed by your assignment?

Get assistance from our PROFESSIONAL ASSIGNMENT WRITERS to receive 100% assured AI-free and high-quality documents on time, ensuring an A+ grade in all subjects.

place order now WhatsApp Order live chat

Here, to retrieve the data of covid patients, the Elastic search engine is used. It is free and data like textual, structured, numerical, and unstructured data can be obtained very easily using the open search feature. It is widely used because of its high speed and reliability.

Data Set collection of Covid-19

The data of covid 19 pandemic was collected from the Kaggle Platform. It's a public platform of the dataset and is completely free. It is generally provided to the research committees, which are global (Priyadarshini et al 2021). Some recent advances are applied to natural-language processing and also some AI techniques are applied in order to generate new ideas for the fight against Covid-19. It is a combination of 500,000 articles and almost 200000 full texts included in the data set of Covid-19.

Running instruction of Elastic Search

Indexing

It is a process of finding out emails, files, records, and any other information. On performing indexing, seeking information is easier and faster. Generally, all the properties of files are indexed, which includes file names and paths of files (Mizzaro et al 2019). Indexing in the database is done by storing the data in rows and columns organized properly in the database. Here, indexing helped in finding the data related to Covid-19 cases. It initiates its process using the "natural language process".

Tokenization

The process of separating or splitting or tokenization a string into individual words where these sentences or words are said to be tokens. It is a very important step and the foundation step of applying NLP ( Natural Language Processing). This is done in order to understand the meaning of some given text and to interpret the text by analyzing the present words using tokens. When huge data is present, the use of a token can increase the efficiency of the search. It can tokenize not only the English language but also other languages (Scagnetto et al. 2019). It splits the words using spaces and punctuation in a proper way

Scan QR code from mobile camera

Grab an Extra 10% OFF on WhatsApp order!

use discount

Normalization

It is a process of data reorganization in a database. This process is done in order to store the data in tables with more than one data. Some benefits of doing normalization are that it "reduces redundant data", "higher database security", "quicker and better execution", "greater overall database organization", "more flexible data design", and "provides consistency within the database"

Selecting Keywords

After the process of tokenization and normalization, the next keywords are selected. These are basically important phrases that are searched in the document for further analysis. The keyword is extracted by following the processes. First, the dataset is loaded and a text field is identified for analysis. Then a stop-word list is created. After that, a clean, normal text data-based needs to be pre-processed (Billen et al. 2019). Then, the most frequently used keywords need to be extracted. Last, but not least, the "TF-IDF" terms list needs to be extracted. On proceeding with the extraction process of the keyword, an n-gram is created.

Morphological Analysis or stemming

Morphological analysis or stemming involves mitigating the keyword to its base word by removing the prefix and suffix that are present within the words and also deleting the root word known as “lemma”. The method of transforming the word to “lemma” is known as “lemmatization”.

Searching

Searching is the last process of NLP on any text document, here it is for Elastic search. Search engines are the most well-known form of searching for any required information. There are two types of search engines and they are semantic and keyword search. Keyword search is the process of searching using the "lexical method" and searching for “query words”. On the other hand, semantic search is the process in which query building is simplified using “latent semantic indexing”.

Implementation

Precision Recall

Conclusion

The project was completed successfully using an open-source dataset which is freely available on the "Kaggle platform”. The dataset was prepared using the sample of articles selected and four columns were chosen and they were “title”, “abstract”, “journalist”, and “author”.

After completion of the installation of Kibana and Electric search, some of the tasks which were being performed were cleaning, processing data, using various steps like tokenization, stemming, lemmatization, and identifying phrases and keys.

Reference

Banawan, K. and Ulukus, S. 2018, Noisy Private Information Retrieval: On Separability of Channel Coding and Information Retrieval, Cornell University Library, arXiv.org, Ithaca.

Barakhnin, V.B., Duisenbayeva, A.N., Kozhemyakina, O.Y., Yergaliyev, Y.N. and Muhamedyev, R.I. 2018, "The automatic processing of the texts in natural language. Some bibliometric indicators of the current state of this research area", Journal of Physics: Conference Series, vol. 1117, no. 1.

Cedeno-Moreno, D. and Vargas-Lombardo, M. 2018, "An Ontology-Based Knowledge Methodology in the Medical Domain in the Latin America: the Study Case of Republic of Panama", Acta Informatica Medica, vol. 26, no. 2, pp. 98-101.

Fourie, I. 2019, "Naresh Agarwal . Synthesis Lectures on Information Concepts, Retrieval, and Services . San Rafael, CA : Morgan and Claypool, 2018 . 200 pp. $14.99 (e?book). (ISBN 9781681730820) Naresh Agarwal . Synthesis Lectures on Information Concepts, Retrieval, and Services . San Rafael, CA : Morgan and Claypool, 2018 . 200 pp. $14.99 (e?book). (ISBN 9781681730820)", Journal of the Association for Information Science and Technology, vol. 70, no. 3, pp. 301-303.

G-A Nys, J-P Kasprzyk, Hallot, P. and Billen, R. 2019, A Semantic Retrieval System In Remote Sensing Web Platforms, Copernicus GmbH, Gottingen.

Haroon, M. 2018, "Comparative Analysis of Stemming Algorithms for Web Text Mining", International Journal of Modern Education and Computer Science, vol. 11, no. 9, pp. 20.

Jackson, R., Kartoglu, I., Stringer, C., Gorrell, G., Roberts, A., Song, X., Wu, H., Agrawal, A., Lui, K., Groza, T., Lewsley, D., Northwood, D., Folarin, A., Stewart, R. and Dobson, R. 2018, "CogStack - experiences of deploying integrated information retrieval and extraction services in a large National Health Service Foundation Trust hospital", BMC Medical Informatics and Decision Making, vol. 18.

Jin, N. 2022, "Natural Language Processing Technology Used in Artificial Intelligence Scene of Law for Human Behavior", Wireless Communications and Mobile Computing (Online), vol. 2022.

Kumar, G., Basri, S., Abdullahi, A.I., Sunder, A.K., Capretz, L.F. and Abdullateef, O.B. 2021, "Data Harmonization for Heterogeneous Datasets: A Systematic Literature Review", Applied Sciences, vol. 11, no. 17, pp. 8275.

Lara-Clares, A., Lastra-Díaz, J.,J. and Garcia-Serrano, A. 2021, "Protocol for a reproducible experimental survey on biomedical sentence similarity", PLoS One, vol. 16, no. 3.

Liu, Y., Liu, Q., Han, C., Zhang, X. and Wang, X. 2019, "The implementation of natural language processing to extract index lesions from breast magnetic resonance imaging reports", BMC Medical Informatics and Decision Making, vol. 19, pp. 1-10.

Locatelli, M., Seghezzi, E., Pellegrini, L., Tagliabue, L.C. and Giuseppe Martino, D.G. 2021, "Exploring Natural Language Processing in Construction and Integration with Building Information Modeling: A Scientometric Analysis", Buildings, vol. 11, no. 12, pp. 583.

Luander Cipriano de Jesus Falcão, Lopes, B. and Renato, R.S. 2022, "Absorção das tarefas de processamento de Linguagem Natural (NLP) pela Ciência da Informação (CI): uma revisão da literatura para tangibilização do uso de NLP pela CI", Em Questão, vol. 28, no. 1, pp. 13-34.

Mizzaro, S. and Scagnetto, I. 2019, "Mobile Search Behaviors: An In?Depth Analysis Based on Contexts, APPs, and Devices . Dan Wu and Shaobo Liang . Synthesis Lectures on Information Concepts, Retrieval, and Services . San Rafael, CA : Morgan and Claypool, 2018 . 159 pp. $51.96 (e?book). (ISBN 9781681733005). Mobile Search Behaviors: An In?Depth Analysis Based on Contexts, APPs, and Devices . Dan Wu and Shaobo Liang . Synthesis lectures on information concepts, retrieval, and services . San Rafael, CA : Morgan and Claypool, 2018 . 159 pp. $51.96(e?Book). (ISBN 9781681733005)", Journal of the Association for Information Science and Technology, vol. 70, no. 11, pp. 1290-1292.

Nurrokhim, M.F., Riza, L.S. and Rasim 2019, "Generating mind map from an article using machine learning", Journal of Physics: Conference Series, vol. 1280, no. 3.

Priyadarshini, R., Anuratha, K., Rajendran, N., Jeyanthi, S. and Sujeetha, S. 2021, "LeDoCl : A Semantic Model for Legal Documents Classification using Ensemble Methods", Turkish Journal of Computer and Mathematics Education, vol. 12, no. 9, pp. 1899-1908.