KDIR 2020 Abstracts


Full Papers
Paper Nr: 6
Title:

Conversation Management in Task-oriented Turkish Dialogue Agents with Dialogue Act Classification

Authors:

O. Fatih Kilic, Enes B. Dundar, Yusufcan Manav, Tolga Cekic and Onur Deniz

Abstract: We study the problem of dialogue act classification to be used in conversation management of goal-oriented dialogue systems. Online chat behavior in human-machine dialogue systems differs from human-human spoken conversations. To this end, we develop 9 dialogue act classes by observing real-life human conversations from a banking domain Turkish dialogue agent. We then propose a dialogue policy based on these classes to correctly direct the users to their goals in a chatbot-human support hybrid dialogue system. To train a dialogue act classifier, we annotate a corpus of human-machine dialogues consisting of 426 conversations and 5020 sentences. Using the annotated corpus, we train a self-attentive bi-directional LSTM dialogue act classifier, which achieves 0.90 weighted F1-score on a sentence level classification performance. We deploy the trained model in the conversation manager to maintain the designed dialogue policy.

Paper Nr: 7
Title:

Saudi Stock Market Sentiment Analysis using Twitter Data

Authors:

Amal Alazba, Nora Alturayeif, Nouf Alturaief and Zainab Alhathloul

Abstract: Sentiment analysis in the finance domain is widely applied by investors and researchers, but most of the work is conducted for English text. In this work, we present a framework to analyze and visualize the sentiments of Arabic tweets related to the Saudi stock market using machine learning methods. For the purpose of training and prediction, Twitter API was used for collecting off-line data, and Apache Kafka was used for real-time streaming tweets. Experiments were conducted using five machine learning classifiers with different feature extraction methods, including word embedding (word2vec) and the traditional BoW methods. The highest accuracy for the sentiment classification of Arabic tweets was 79.08%. This result was achieved with the SVM classifier combined with the TF-IDF feature extraction method. At the end, the predicted sentiments of the tweets using the outperforming classifier were visualized by several techniques. We developed a website to visualize the off-line and streaming tweets in various ways: by sentiments, by stock sectors, and by frequent terms.

Paper Nr: 12
Title:

Improving Word Association Measures in Repetitive Corpora with Context Similarity Weighting

Authors:

Aleksi Sahala and Krister Lindén

Abstract: Although word association measures are useful for deciphering the semantic nuances of long extinct languages, they are very sensitive to excessively formulaic narrative patterns and full or partial duplication caused by different copies, edits, or fragments of historical texts. This problem is apparent in the corpora of the ancient Mesopotamian languages such as Sumerian and Akkadian. When word associations are measured, vocabulary from repetitive passages tends to dominate the top-ranks and conceal more interesting and descriptive use of the language. We propose an algorithmic way to reduce the impact of repetitiveness by weighting the co-occurrence probabilities by a factor based on their contextual similarity. We demonstrate that the proposed approach does not only effectively reduce the impact of distortion in repetitive corpora, but that it also slightly improves the performance of several PMI-based association measures in word relatedness tasks in non-repetitive corpora. Additionally, we propose normalization for PMI2, a commonly-used association measure, and show that the normalized variant can outperform the base measure in both, repetitive and non-repetitive corpora.

Paper Nr: 14
Title:

The Max-Cut Decision Tree: Improving on the Accuracy and Running Time of Decision Trees

Authors:

Jonathan Bodine and Dorit S. Hochbaum

Abstract: Decision trees are a widely used method for classification, both alone and as the building blocks of multiple different ensemble learning methods. The Max-Cut decision tree involves novel modifications to a standard, baseline model of classification decision tree, precisely CART Gini. One modification involves an alternative splitting metric, Maximum Cut, which is based on maximizing the distance between all pairs of observations that belong to separate classes and separate sides of the threshold value. The other modification is to select the decision feature from a linear combination of the input features constructed using Principal Component Analysis (PCA) locally at each node. Our experiments show that this node-based, localized PCA with the novel splitting modification can dramatically improve classification, while also significantly decreasing computational time compared to the baseline decision tree. Moreover, our results are most significant when evaluated on data sets with higher dimensions, or more classes. For the example data set CIFAR-100, the modifications enabled a 49% improvement in accuracy, relative to CART Gini, while reducing CPU time by 94% for comparable implementations. These introduced modifications will dramatically advance the capabilities of decision trees for difficult classification tasks.

Paper Nr: 18
Title:

Large-scale Retrieval of Bayesian Machine Learning Models for Time Series Data via Gaussian Processes

Authors:

Fabian Berns and Christian Beecks

Abstract: Gaussian Process Models (GPMs) are widely regarded as a prominent tool for learning statistical data models that enable timeseries interpolation, regression, and classification. These models are frequently instantiated by a Gaussian Process with a zero-mean function and a radial basis covariance function. While these default instantiations yield acceptable analytical quality in terms of model accuracy, GPM retrieval algorithms automatically search for an application-specific model fitting a particular dataset. State-of-the-art methods for automatic retrieval of GPMs are searching the space of possible models in a rather intricate way and thus result in super-quadratic computation time complexity for model selection and evaluation. Since these properties only enable processing small datasets with low statistical versatility, we propose the Timeseries Automatic GPM Retrieval (TAGR) algorithm for efficient retrieval of large-scale GPMs. The resulting model is composed of independent statistical representations for non-overlapping segments of the given data and reduces computation time by orders of magnitude. Our performance analysis indicates that our proposal is able to outperform state-of-the-art algorithms for automatic GPM retrieval with respect to the qualities of efficiency, scalability, and accuracy.

Paper Nr: 22
Title:

Bottom-up Discovery of Context-aware Quality Constraints for Heterogeneous Knowledge Graphs

Authors:

Xander Wilcke, Maurice de Kleijn, Victor de Boer, Henk Scholten and Frank van Harmelen

Abstract: As knowledge graphs are getting increasingly adopted, the question of how to maintain the validity and accuracy of our knowledge becomes ever more relevant. We introduce context-aware constraints as a means to help preserve knowledge integrity. Context-aware constraints offer a more fine-grained control of the domain onto which we impose restrictions. We also introduce a bottom-up anytime algorithm to discover context-aware constraint directly from heterogeneous knowledge graphs—graphs made up from entities and literals of various (data) types which are linked using various relations. Our method is embarrassingly parallel and can exploit prior knowledge in the form of schemas to reduce computation time. We demonstrate our method on three different datasets and evaluate its effectiveness by letting experts on knowledge validation and management assess candidate constraints in a real-world knowledge validation use case. Our results show that overall, context-aware constraints are to an extent useful for knowledge validation tasks, and that the majority of the generated constraints are well balanced with respect to complexity.

Paper Nr: 26
Title:

A Recurrent Neural Network and Differential Equation based Spatiotemporal Infectious Disease Model with Application to COVID-19

Authors:

Zhijian Li, Yunling Zheng, Jack Xin and Guofa Zhou

Abstract: The outbreaks of Coronavirus Disease 2019 (COVID-19) have impacted the world significantly. Modeling the trend of infection and real-time forecasting of cases can help decision making and control of the disease spread. However, data-driven methods such as recurrent neural networks (RNN) can perform poorly due to limited daily samples in time. In this work, we develop an integrated spatiotemporal model based on the epidemic differential equations (SIR) and RNN. The former after simplification and discretization is a compact model of temporal infection trend of a region while the latter models the effect of nearest neighboring regions. The latter captures latent spatial information. We trained and tested our model on COVID-19 data in Italy, and show that it out-performs existing temporal models (fully connected NN, SIR, ARIMA) in 1-day, 3-day, and 1-week ahead forecasting especially in the regime of limited training data.

Paper Nr: 27
Title:

CoExDBSCAN: Density-based Clustering with Constrained Expansion

Authors:

Benjamin Ertl, Jörg Meyer, Matthias Schneider and Achim Streit

Abstract: Full space clustering methods suffer the curse of dimensionality, for example points tend to become equidistant from one another as the dimensionality increases. Subspace clustering and correlation clustering algorithms overcome these issues, but still face challenges when data points have complex relations or clusters overlap. In these cases, clustering with constraints can improve the clustering results, by including a priori knowledge into the clustering process. This article proposes a new clustering algorithm CoExDBSCAN, density-based clustering with constrained expansion, which combines traditional, density-based clustering with techniques from subspace, correlation and constrained clustering. The proposed algorithm uses DBSCAN to find density-connected clusters in a defined subspace of features and restricts the expansion of clusters to a priori constraints. We provide verification and runtime analysis of the algorithm on a synthetic dataset and experimental evaluation on a climatology dataset of satellite observations. The experimental dataset demonstrates, that our algorithm is especially suited for spatio-temporal data, where one subspace of features defines the spatial extent of the data and another correlations between features.

Paper Nr: 39
Title:

Geographic Feature Engineering with Points-of-Interest from OpenStreetMap

Authors:

Adelson de Araujo, João Marcos do Valle and Nélio Cacho

Abstract: Although geographic patterns have been considered in statistical modelling for many years, new volunteered geographical information is opening opportunities for estimating variables of the city using the urban characteristics of places. Studies have shown the effectiveness of using Points-of-Interest (PoI) data in various predictive applications domains involving geographic data science, e.g. crime hot spots, air quality and land usage analysis. However, it is hard to find the data sources mentioned in these studies and which are the best practices of extracting useful covariates from them. In this study, we propose the Geohunter, a reproducible geographic feature engineering procedure that relies on OpenStreetMap, with a software interface to commonly used tools for geographic data analysis. We also analysed two feature engineering procedures, the quadrat method and KDE in which we conduct a qualitative and quantitative evaluation to suggest which better translate geographic patterns of the city. Further, we provide some illustrative examples of Geohunter applications.

Paper Nr: 44
Title:

Amharic Document Representation for Adhoc Retrieval

Authors:

Tilahun Yeshambel, Josiane Mothe and Yaregal Assabie

Abstract: Amharic is the official language of the government of Ethiopia currently having an estimated population of over 110 million. Like other Semitic languages, Amharic is characterized by complex morphology where thousands of words are generated from a single root form through inflection and derivation. This has made the development of tools for Amharic natural language processing a non-trivial task. Amharic adhoc retrieval faces difficulties due to the complex morphological structure of the language. In this paper, the impact of morphological features on the representation of Amharic documents and queries for adhoc retrieval is investigated. We analyze the effects of stem-based and root-based approaches on Amharic adhoc retrieval effectiveness. Various experiments are conducted on TREC-like Amharic information retrieval test collection using standard evaluation framework and measures. The findings show that a root-based approach outperforms the conventional stem-based approach that prevails in many other languages.

Short Papers
Paper Nr: 4
Title:

A Feature Space Transformation to Intrusion Detection Systems

Authors:

Roberto Saia, Salvatore Carta, Diego R. Recupero and Gianni Fenu

Abstract: The anomaly-based Intrusion Detection Systems (IDSs) represent one of the most efficient methods in countering the intrusion attempts against the ever growing number of network-based services. Despite the central role they play, their effectiveness is jeopardized by a series of problems that reduce the IDS effectiveness in a real-world context, mainly due to the difficulty of correctly classifying attacks with characteristics very similar to a normal network activity or, again, due to the difficulty of contrasting novel forms of attacks (zero-days). Such problems have been faced in this paper by adopting a Twofold Feature Space Transformation (TFST) approach aimed to gain a better characterization of the network events and a reduction of their potential patterns. The idea behind such an approach is based on: (i) the addition of meta-information, improving the event characterization; (ii) the discretization of the new feature space in order to join together patterns that lead back to the same events, reducing the number of false alarms. The validation process performed by using a real-world dataset indicates that the proposed approach is able to outperform the canonical state-of-the-art solutions, improving their intrusion detection capability.

Paper Nr: 5
Title:

Estimating Personalization using Topical User Profile

Authors:

Sara Abri, Rayan Abri and Salih Cetin

Abstract: Exploring the effect of personalization on different queries can improve the ranking result. There is a need for a mechanism to estimate the potential for personalization for queries. Previous methods to estimate the potential for personalization such as click entropy and topic entropy are based on the prior clicked document for query or query history. They have limitations like unavailability of the prior clicked data for new/unseen queries or queries without history. To alleviate the problem, we provide a solution for the queries regardless of query history. In this paper, we present a new metric using the topic distribution of user documents in the topical user profile, to estimate the potential for personalization for all queries. Using the proposed metric, we can achieve more performance for queries with history and solve the cold start problem of queries without history. To improve personalized search, we provide a personalization ranking model by combining personalized and non-personalized topic models where the proposed metric is used to estimate personalization. The result reveals that the personalization ranking model using the proposed metric improves the Mean Reciprocal Rank and the Normalized Discounted Cumulative Gain by 5% and 4% respectively.

Paper Nr: 8
Title:

Ontology-based Methods for Classifying Scientific Datasets into Research Domains: Much Harder than Expected

Authors:

Xu Wang, Frank Van Harmelen and Zhisheng Huang

Abstract: Scientific datasets are increasingly stored, published, and re-used online. This has prompted major search engines to start services dedicated to finding research datasets online. However, to date such services are limited to keyword search, and provide little or no semantic guidance. Determining the scientific domain for a given dataset is a crucial part in dataset recommendation and search: ”Which research domain does this dataset belong to?”. In this paper we investigate and compare a number of novel ontology-based methods to answer that question, using the distance between a domain-ontology and a dataset as an estimator for the domain(s) into which the dataset should be classified. We also define a simple keyword-based classifier based on the Normalized Google Distance, and we evaluate all classifiers on a hand-constructed gold standard. Our two main findings are that the seemingly simple task of determining the domain(s) of a dataset is surprisingly much harder than expected (even when performed under highly simplified circumstances), and that (again surprisingly), the use of ontologies seems to be of little help in this task, with the simple keyword-based classifier outperforming every ontology-based classifier. We constructed a gold-standard benchmark for our experiments which we make available online for others to use.

Paper Nr: 11
Title:

Diverse Group Formation based on Multiple Demographic Features

Authors:

Mohammed Alqahtani, Susan Gauch, Omar Salman, Mohammed Ibrahim and Reem Al-Saffar

Abstract: The goal of group formation is to build a team to accomplish a specific task. Algorithms are employed to improve the effectiveness of the team so formed and the efficiency of the group selection process. However, there is concern that team formation algorithms could be biased against minorities due to the algorithms themselves or the data on which they are trained. Hence, it is essential to build fair team formation systems that incorporate demographic information into the process of building the group. Although there has been extensive work on modeling individuals’ expertise for expert recommendation and/or team formation, there has been relatively little prior work on modeling demographics and incorporating demographics into the group formation process. We propose a novel method to represent experts’ demographic profiles based on multidimensional demographic features. Moreover, we introduce two diversity ranking algorithms that form a group by considering demographic features along with the minimum required skills. Unlike many ranking algorithms that consider one Boolean demographic feature (e.g., gender or race), our diversity ranking algorithms consider multiple multivalued demographic attributes simultaneously. We evaluate our proposed algorithms using a real dataset based on members of a computer science program committee. The result shows that our algorithms form a program committee that is more diverse with an acceptable loss in utility.

Paper Nr: 13
Title:

Session Similarity based Approach for Alleviating Cold-start Session Problem in e-Commerce for Top-N Recommendations

Authors:

Ramazan Esmeli, Mohamed Bader-El-Den and Hassana Abdullahi

Abstract: Cold-Start problem is one of the main challenges for the recommender systems. There are many methods developed for traditional recommender systems to alleviate the drawback of cold-start user and item problems. However, to the best of our knowledge, in session based recommender systems cold-start session problem still needs to be investigated. In this paper, we propose a session similarity-based method to alleviate drawback of cold-start sessions in e-commerce domain, in which there are no interacted items in the sessions that can help to identify users’ preferences. In the proposed method, product recommendations are given based on the most similar sessions that are found using session features such as session start time, location, etc. Computational experiments on two real-world datasets show that when the proposed method applied, there is a significant improvement on the performance of recommender systems in terms of recall and precision metrics comparing to random recommendations for cold-start sessions.

Paper Nr: 19
Title:

FLE: A Fuzzy Logic Algorithm for Classification of Emotions in Literary Corpora

Authors:

Luis-Gil Moreno-Jiménez, Juan-Manuel Torres-Moreno, Hanifa Boucheneb and Roseli S. Wedemann

Abstract: This paper presents an algorithm based on fuzzy logic, devised to identify emotions in corpora of literary texts, called Fuzzy Logic Emotions (FLE) classifier. This algorithm evaluates a sentence to define the class(es) of emotions to which it belongs. For this purpose, it considers three types of linguistic variables (verb, noun and adjective) with associated linguistic values used to qualify the emotion they express. A numerical value is computed for each of these terms within a sentence, based on its frequency and the inverse document frequency (TF-IDF). We have tested our FLE classifier with an evaluation protocol, using a literary corpus in Spanish specially structured for working with the automatic detection of emotions in text. We present encouraging performance results favoring our FLE classifier, when compared to other known algorithms established in the literature used for the detection of emotions in text.

Paper Nr: 21
Title:

Towards Strength-sensitive Social Profiling in Ego Networks

Authors:

Asma Chader, Hamid Haddadou, Leila Hamdad and Walid-Khaled Hidouci

Abstract: In online social networks, the incomplete or noisy data are usual conditions raising increasingly the need for more accurate methods; especially in user attribute profiling. This work explores the influence of social tie strength in such settings, based on the intuition that the stronger the relationship is, the more likely its members are to share the same attribute values. A Strength-sensitive community-based social profiling process, named SCoBSP, is introduced under this research and the above hypothesis is tested on real world co-authorship networks from the DBLP computer science bibliography. Experimental results demonstrate the ability of SCoBSP to infer attributes accurately, achieving an improvement of 9.18 % in terms of F-measure over the strength-agnostic process.

Paper Nr: 23
Title:

Weak Ties Are Surprising Everywhere

Authors:

Iaakov Exman, Asaf Yosef and Omer Ganon

Abstract: Weak ties between people have been known as surprisingly effective to successfully achieve practical goals, such as getting a job. However, weak ties were often assumed to correlate with topological distance in virtual social networks. The unexpected novelty of this paper is that weak ties are surprisingly everywhere, independently of topological distance. This is shown by modelling luck with reference to a target task, as a composition of a surprise function expressing weak ties and a target relevance function expressing strong ties between people. The model enables an automatic luck generation software tool, to support target tasks mainly by the surprise function. The main result is obtained by superposing the luck model upon network topological maps of customer relationships to its followers in any chosen social network. The result is validated by surprise Keyword Clouds of customer followers and Keyword Frequencies for diverse followers. Results are illustrated by a variety of graphs calculated for specific customers.

Paper Nr: 28
Title:

Extracting Body Text from Academic PDF Documents for Text Mining

Authors:

Changfeng Yu, Cheng Zhang and Jie Wang

Abstract: Accurate extraction of body text from PDF-formatted academic documents is essential in text-mining applications for deeper semantic understandings. The objective is to extract complete sentences in the body text into a txt file with the original sentence flow and paragraph boundaries. Existing tools for extracting text from PDF documents would often mix body and nonbody texts. We devise and implement a system called PDFBoT to detect multiple-column layouts using a line-sweeping technique, remove nonbody text using computed text features and syntactic tagging in backward traversal, and align the remaining text back to sentences and paragraphs. We show that PDFBoT is highly accurate with average F1 scores of, respectively, 0.99 on extracting sentences, 0.96 on extracting paragraphs, and 0.98 on removing text on tables, figures, and charts over a corpus of PDF documents randomly selected from arXiv.org across multiple academic disciplines.

Paper Nr: 30
Title:

Target Evaluation for Neural Language Model using Japanese Case Frame

Authors:

Kazuhito Tamura, Ikumi Suzuki and Kazuo Hara

Abstract: Automatic text generation are widely used in various type of natural language processing systems. It is crucial to capture correct grammar for these systems to work. According to the recent studies, neural language models successfully acquire English grammar. However, it’s not thoroughly investigated why the neural language models work. Therefore, fine-grained grammatical or syntactic analysis is important to assess neural language models. In this paper, we constructed grammatical evaluation methods to assess Japanese grammatical ability in neural language models by adopting a target evaluation approach. We especially focus on case marker and verb match in Japanese case grammar. In experiments, we report the grammatical ability of neural language model by comparing n-gram models. Neural language model performed better even some information lacks, while n-gram performs poorly. Also, Neural language model exhibited more robust performance for low frequency terms.

Paper Nr: 31
Title:

Filtering a Reference Corpus to Generalize Stylometric Representations

Authors:

Julien Hay, Bich-Liên Doan, Fabrice Popineau and Ouassim A. Elhara

Abstract: Authorship analysis aims at studying writing styles to predict authorship of a portion of a written text. Our main task is to represent documents so that they reflect authorship. To reach the goal, we use these representations for the authorship attribution, which means the author of a document is identified out of a list of known authors. We have recently shown that style can be generalized to a set of reference authors. We trained a DNN to identify the authors of a large reference corpus and then learnt how to represent style in a general stylometric space. By using such a representation learning method, we can embed new documents into this stylometric space, and therefore stylistic features can be highlighted. In this paper, we want to validate the following hypothesis: the more authorship terms are filtered, the more models can be generalized. Attention can thus be focused on style-related and constituent linguistic structures in authors’ styles. To reach this aim, we suggest a new efficient and highly scalable filtering process. This process permits a higher accuracy on various test sets on both authorship attribution and clustering tasks.

Paper Nr: 32
Title:

Enhanced Active Learning of Convolutional Neural Networks: A Case Study for Defect Classification in the Semiconductor Industry

Authors:

Georgios Koutroulis, Tiago Santos, Michael Wiedemann, Christian Faistauer, Roman Kern and Stefan Thalmann

Abstract: With the advent of high performance computing and scientific advancement, deep convolutional neural networks (CNN) have already been established as the best candidate for image classification tasks. A decisive requirement for successful deployment of CNN models is the vast amount of annotated images, which usually is a costly and quite tedious task, especially within an industrial environment. To address this deployment barrier, we propose an enhanced active learning framework of a CNN model with a compressed architecture for chip defect classification in semiconductor wafers. Our framework unfolds in two main steps and is performed in an iterative manner. First, a subset of the most informative samples is queried based on uncertainty estimation. Second, spatial metadata of the queried images are utilized for a density-based clustering in order to discard noisy instances and to keep only those ones that constitute systematic defect patterns in the wafer. Finally, a reduced and more representative subset of images are passed for labelling, thus minimizing the manual labour of the process engineer. In each iteration, the performance of the CNN model is considerably improved, as only those images are labeled that will help the model to better generalize. We validate the effectiveness of our framework using real data from running processes of a semiconductor manufacturer.

Paper Nr: 35
Title:

Personalised Recommendation Systems and the Impact of COVID-19: Perspectives, Opportunities and Challenges

Authors:

Rabaa Abdulrahman and Herna L. Viktor

Abstract: Personalised Recommendation Systems that utilize machine learning algorithms have had much success in recent years, leading to accurate predictions in many e-business domains. However, this environment experienced abrupt changes with the onset of the COVID-19 pandemic centred on an exponential increase in the volume of customers and swift alterations in customer behaviours and profiles. This position paper discusses the impact of the COVID-19 pandemic on the Recommendation Systems landscape and focuses on new and atypical users. We detail how online machine learning algorithms that are able to detect and subsequently adapt to changes in consumer behaviours and profiles can be used to provide accurate and timely predictions regarding this evolving consumer sector.

Paper Nr: 36
Title:

R-peak Detector Benchmarking using FieldWiz Device and Physionet Databases

Authors:

Tiago Rodrigues, Hugo Silva and Ana Fred

Abstract: The R-peak detection in an Electrocardiography (ECG) signal is of great importance for Heart Rate Variability (HRV) studies and feature extraction based on fiducial points. In this paper, a real-time and low-complexity algorithm for R-peak detection is evaluated on single lead ECG signals. The method is divided in a pre-processing and a detection stage. First, the pre-processing is based on a double differentiating step, squaring and moving window integration for QRS complex enhancement. Secondly, the detection stage is based on a finite state machine (FSM) with an adaptive thresholding for R-peak detection. The tested approach was benchmarked in a private FieldWiz Database with other commonly used QRS detectors, and later evaluated in the Physionet Databases (mitdb, nstdb, ltstdb and CinC Challenge 2014). The proposed approach resulted in a Sensivitity (Se) of 99.77% and Positive Predictive Value (PPV) of 99.18% in the FieldWiz database, comparable with the evaluated state of the art QRS detectors. In the Physionet Databases, the results showed to be highly influenced by the QRS waveform, for MIT-BIH (MITDB) achieving a median PPV of 99.79% and a median Se of 99.52%, with overall PPV of 98.35% and Se of 97.62%. The evaluated method can be implemented in wearable systems for cardiovascular tracking devices in dynamic use cases with good quality ECG signals, achieving comparable results to state of the art detectors.

Paper Nr: 37
Title:

Generating Adequate Distractors for Multiple-Choice Questions

Authors:

Cheng Zhang, Yicheng Sun, Hejia Chen and Jie Wang

Abstract: This paper presents a novel approach to automatic generation of adequate distractors for a given question-answer pair (QAP) generated from a given article to form an adequate multiple-choice question (MCQ). Our method is a combination of part-of-speech tagging, named-entity tagging, semantic-role labeling, regular expressions, domain knowledge bases, word embeddings, word edit distance, WordNet, and other algorithms. We use the US SAT (Scholastic Assessment Test) practice reading tests as a dataset to produce QAPs and generate three distractors for each QAP to form an MCQ. We show that, via experiments and evaluations by human judges, each MCQ has at least one adequate distractor and 84% of MCQs have three adequate distractors.

Paper Nr: 38
Title:

Unsupervised Descriptive Text Mining for Knowledge Graph Learning

Authors:

Giacomo Frisoni, Gianluca Moro and Antonella Carbonaro

Abstract: The use of knowledge graphs (KGs) in advanced applications is constantly growing, as a consequence of their ability to model large collections of semantically interconnected data. The extraction of relational facts from plain text is currently one of the main approaches for the construction and expansion of KGs. In this paper, we introduce a novel unsupervised and automatic technique of KG learning from corpora of short unstructured and unlabeled texts. Our approach is unique in that it starts from raw textual data and comes to: i) identify a set of relevant domain-dependent terms; ii) extract aggregate and statistically significant semantic relationships between terms, documents and classes; iii) represent the accurate probabilistic knowledge as a KG; iv) extend and integrate the KG according to the Linked Open Data vision. The proposed solution is easily transferable to many domains and languages as long as the data are available. As a case study, we demonstrate how it is possible to automatically learn a KG representing the knowledge contained within the conversational messages shared on social networks such as Facebook by patients with rare diseases, and the impact this can have on creating resources aimed to capture the “voice of patients”.

Paper Nr: 40
Title:

Stock Trend Prediction using Financial Market News and BERT

Authors:

Feng Wei and Uyen T. Nguyen

Abstract: Stock market trend prediction is an attractive research topic since successful predictions of the market’s future movement could result in significant profits. Recent advances in language representation such as Generative Pre-trained Transformer (GPT) and Bidirectional Encoder Representations from Transformers (BERT) models have shown success in incorporating a pre-trained transformer language model and fine-tuning operations to improve downstream natural language processing (NLP) systems. In this paper, we apply the popular BERT model to leverage financial market news to predict stock price movements. Experimental results show that our proposed methods are simple but very effective, which can significantly improve the stock prediction accuracy on a standard financial database over the baseline system and existing work.

Paper Nr: 45
Title:

Historical Document Processing: A Survey of Techniques, Tools, and Trends

Authors:

James Philips and Nasseh Tabrizi

Abstract: Historical Document Processing (HDP) is the process of digitizing written material from the past for future use by historians and other scholars. It incorporates algorithms and software tools from computer vision, document analysis and recognition, natural language processing, and machine learning to convert images of ancient manuscripts and early printed texts into a digital format usable in data mining and information retrieval systems. As libraries and other cultural heritage institutions have scanned their historical document archives, the need to transcribe the full text from these collections has become acute. Since HDP encompasses multiple sub-domains of computer science, knowledge relevant to its purpose is scattered across numerous journals and conference proceedings. This paper surveys the major phases of HDP, discussing standard algorithms, tools, and datasets and finally suggests directions for further research.

Paper Nr: 9
Title:

Using Affective Features from Media Content Metadata for Better Movie Recommendations

Authors:

John K. Leung, Igor Griva and William G. Kennedy

Abstract: This paper investigates the causality in the decision making of movie recommendations through the users' affective profiles. We advocate a method of assigning emotional tags to a movie by the auto-detection of the affective features in the movie's overview. We apply a text-based Emotion Detection and Recognition model, which trained by tweets short messages and transfers the learned model to detect movie overviews’ implicit affective features. We vectorize the affective movie tags to represent the mood embeddings of the movie. We obtain the user's emotional features by taking the average of all the movies' affective vectors the user has watched. We apply five-distance metrics to rank the Top-N movie recommendations against the user's emotion profile. We found Cosine Similarity distance metrics performed better than other distance metrics measures. We conclude that by replacing the top-N recommendations generated by the Recommender with the reranked recommendations list made by the Cosine Similarity distance metrics, the user will effectively get affective aware top-N recommendations while making the Recommender feels like an Emotion Aware Recommender.

Paper Nr: 15
Title:

Large Scale Intent Detection in Turkish Short Sentences with Contextual Word Embeddings

Authors:

Enes B. Dündar, Osman F. Kılıç, Tolga Çekiç, Yusufcan Manav and Onur Deniz

Abstract: We have developed a large-scale intent detection method for our Turkish conversation system in banking domain to understand the problems of our customers. Recent advancements in natural language processing(NLP) have allowed machines to understand the words in a context by using their low dimensional vector representations a.k.a. contextual word embeddings. Thus, we have decided to use two language model architectures that provide contextual embeddings: ELMo and BERT. We trained ELMo on Turkish corpora while we used a pretrained Turkish BERT model. To evaluate these models on an intent classification task, we have collected and annotated 6453 customer messages in 148 intents. Furthermore, another Turkish document classification dataset named Kemik News are used for comparing our method with the state-of-the-art models. Experimental results have shown that using contextual word embeddings boost Turkish document classification performance on various tasks. Moreover, converting Turkish characters to English counterparts results in a slightly better performance. Lastly, an experiment is conducted to find out which BERT layer is more effective to use for intent classification task.

Paper Nr: 17
Title:

Identifying the k Best Targets for an Advertisement Campaign via Online Social Networks

Authors:

Mariella Bonomo, Armando La Placa and Simona E. Rombo

Abstract: We propose a novel approach for the recommendation of possible customers (users) to advertisers (e.g., brands) based on two main aspects: (i) the comparison between On-line Social Network profiles, and (ii) neighborhood analysis on the On-line Social Network. Profile matching between users and brands is considered based on bag-of-words representation of textual contents coming from the social media, and measures such as the Term Frequency-Inverse Document Frequency are used in order to characterize the importance of words in the comparison. The approach has been implemented relying on Big Data Technologies, allowing this way the efficient analysis of very large Online Social Networks. Results on real datasets show that the combination of profile matching and neighborhood analysis is successful in identifying the most suitable set of users to be used as target for a given advertisement campaign.

Paper Nr: 25
Title:

Moving towards a General Metadata Extraction Solution for Research Data with State-of-the-Art Methods

Authors:

Benedikt Heinrichs and Marius Politze

Abstract: Many research data management processes, especially those defined by the FAIR Guiding Principles, rely on metadata for making it findable and re-usable. Most Metadata workflows however require the researcher to describe their data manually, a tedious process which is one of the reasons it is sometimes not done. Therefore, automatic solutions have to be used in order to ensure the findability and re-usability. Current solutions only focus and are effective on extracting metadata in single disciplines using domain knowledge. This paper aims, therefore, at identifying the gaps in current metadata extraction processes and defining a model for a general extraction pipeline for research data. The results of implementing such a model are discussed and a proof-of-concept is shown in the case of video-based data. This model is basis for future research as a testbed to build and evaluate discipline-specific automatic metadata extraction workflows.

Paper Nr: 29
Title:

TAGWAR: An Annotated Corpus for Sequence Tagging of War Incidents

Authors:

Nancy Sawaya, Shady Elbassuoni, Fatima A. Salem and Roaa Al Feel

Abstract: Sequence tagging of free text constitutes an important task in natural language processing (NLP). In this work, we focus on the problem of automatic sequence tagging of news articles reporting on wars. In this context, tags correspond to details surrounding war incidents where a large number of casualties is observed, such as the location of the incident, its date, the cause of death, the actor responsible for the incident, and the number of casualties of different types (civilians, non-civilians, women and children). To this end, we begin by building TAGWAR, a manually sequence tagged dataset consisting of 804 news articles around the Syrian war, and use this dataset to train and test three state-of-the-art, deep learning based, sequence tagging models: BERT, BiLSTM, and a plain Conditional Random Field (CRF) model, with BERT delivering the best performance. Our approach incorporates an element of input sensitivity analysis where we attempt modeling exclusively at the level of articles’ titles, versus titles and first paragraph, and finally versus full text. TAGWAR is publicly available at: https://doi.org/10.5281/zenodo.3766682.

Paper Nr: 33
Title:

Social Media as an Auxiliary News Source

Authors:

Stephen Bradshaw, Colm O’Riordan and Riad Cheikh

Abstract: Obtaining a balanced view of an issue can be a time consuming and arduous task. A reader using only one source of information is in danger of being exposed to an author’s particular slant on a given issue. For many events, social media provides a range of expressions and views on a topic. In this paper we explore the feasibility of mining alternative data and information-sources to better inform users on the issues associated with a topic. For the purpose of gauging the feasibility of augmenting available content with related information, a text similarity metric is adopted to measure relevance of the auxiliary text. The developed system extracts related content from two distinct social media sources, Reddit and Twitter. The results are evaluated through conducting a user survey on the relevance of the returned results. A two tailed Wilcoxon test is applied to evaluate the relevance of addition information snippets. Our results show that by partaking the experiment a users’ level of awareness is augmented, second, that it is possible to better inform the user with information extract from a online microblogging sites.

Paper Nr: 34
Title:

A Survey of Sensor Modalities for Human Activity Recognition

Authors:

Bruce B. Yu, Yan Liu and Keith C. Chan

Abstract: Human Activity Recognition (HAR) has been attempted by various sensor modalities like vision sensors, ambient sensors, and wearable sensors. These heterogeneous sensors are usually used independently to conduct HAR. However, there are few comprehensive studies in the previous literature that investigate the HAR capability of various sensors and examine the gap between the existing HAR methods and their potential application domains. To fill in such a research gap, this survey unfastens the motivation behind HAR and compares the capability of various sensors for HAR by presenting their corresponding datasets and main algorithmic status. To do so, we first introduce HAR sensors from three categories: vision, ambient and wearable by elaborating their available tools and representative benchmark datasets. Then we analyze the HAR capability of various sensors regarding the levels of activities that we defined for indicating the activity complexity or resolution. With a comprehensive understanding of the different sensors, we review HAR algorithms from perspectives of single modal to multimodal methods. According to the investigated algorithms, we direct the future research on multimodal HAR solutions. This survey provides a panorama view of HAR sensors, human activity characteristics and HAR algorithms, which will serve as a source of references for developing sensor-based HAR systems and applications.

Paper Nr: 43
Title:

Analysing the Effect of Platform and Operating System Features on Predicting Consumers’ Purchase Intent using Machine Learning Algorithms

Authors:

Ramazan Esmeli, Alaa Mohasseb and Mohamed Bader-El-Den

Abstract: Predicting future consumer browsing and purchase behaviour has become crucial to many marketing platforms. Consumer purchase intention is one of the main inputs used as a measurement for consumer demand for new products. In addition, identifying consumers’ purchase intent play an important role in recommender systems. In this paper, the effect of using different platforms on users’ behaviours is explored. In addition, the effect of users’ platforms and their purchase intentions behaviours are investigated. We conduct computational experiments using different machine learning algorithms in order to investigate the using users’ operating system and platform types as features. The results showed that the users’ purchase intentions and behaviours are correlated with these features.