Our researcher was selected to make a presentation on ACM conference

Sitong Chen, one of our members, one of his papers was accepted and published on ACM (Association for Computing Machinery) journal. NLP and text mining were used in his project. The paper topic is “Active High-Recall Information Retrieval from Domain-Specific Text Corpora based on Query Documents”[1]. It was used to classify the essays and address them to each label by detecting the abstract section.

Abstract

In this paper, we propose a high recall active document retrieval system for a class of applications involving query documents, as opposed to key terms, and domain-specific document corpora. The output of the model is a list of documents retrieved based on the domain expert feedback collected during training. A modified version of Bag of Word (BoW) representation and a semantic ranking module, based on Google n-grams, are used in the model. The core of the system is a binary document classification model which is trained through a continuous active learning strategy. In general, finding or constructing training data for this type of problem is very difficult due to either confidentiality of the data, or the need for domain expert time to label data. Our experimental results on the retrieval of Call For Papers based on a manuscript demonstrate the efficacy of the system to address this application and its performance compared to other candidate models.

Reference

[1] Chen, S. (2018, August 22). Active High-Recall Information Retrieval from Domain-Specific Text Corpora based on Query Documents. Retrieved from https://dl.acm.org/citation.cfm?id=3209532