The paper “NILINKER: Attention-based approach to NIL Entity Linking” has been published in the Journal of Biomedical Informatics (h-index 112, Scimago Q1 in Computer Science Applications and Q1 in Health Informatics). The authors are LASIGE’s PhD student Pedro Ruas and LASIGE’s integrated researcher Francisco M. Couto. The paper is available here.
The work proposes a new framework to deal with NIL entities in the biomedical domain, which allows the training and the application of an attention-based neural network. The resulting models can extend existing incomplete repositories with updated entries.
The goal of named entity linking is to automatically associate entities recognized in text with entries in target repositories (knowledge bases and graphs, ontologies, vocabularies, etc.) that are able to describe their meaning. However, existing approaches have limited performance when dealing with NIL entities, because in this scenario there is no appropriate entry to describe their meaning. This is particularly evident in domains where there is a constant evolution in the knowledge expressed in text. One striking example is the scientific literature focusing on the COVID-19 topic: before 2020, COVID-19-related entities appearing in literature were inexistent; from 2020 onwards, there was a sudden increase in COVID-19-related entities mentioned in scientific articles. Repositories that store structured knowledge are not able to keep up-to-date when the volume of text is large, since there is a costly manual effort involved in the process. Existing named entity linking models are no help in this task because they can only deal with entities that have a corresponding entry in the target knowledge base.
To bridge this gap, the paper proposes the new designation “NIL entity linking” for the task of partially linking NIL entities to available repository entries. For instance, if a given repository does not include the entry “COVID-19”, the entity “COVID-19” can be associated with related entries, such as “infectious disease” or “pneumonia”, and even be included in the structure of the repository.
The work proposes a new dataset called EvaNIL to train and evaluate a model able to perform this task. Additionally, six models trained on the EvaNIL dataset were made publicly available. These models can link biomedical NIL entities, such as chemicals, diseases, phenotypes, biological processes, and anatomical parts, to entries of several knowledge bases (CTD-Chemical, MEDIC, Human Phenotype Ontology, CTD-Anatomy, Gene Ontology-Biological Process).