KDIR 2019 Abstracts


Full Papers
Paper Nr: 2
Title:

Application of Mixtures of Gaussians for Tracking Clusters in Spatio-temporal Data

Authors:

Benjamin Ertl, Jörg Meyer, Achim Streit and Matthias Schneider

Abstract: Clustering data based on their spatial and temporal similarity has become a research area with increasing popularity in the field of data mining and data analysis. However, most clustering models for spatio-temporal data introduce additional complexity to the clustering process as well as scalability becomes a significant issue for the analysis. This article proposes a data-driven approach for tracking clusters with changing properties over time and space. The proposed method extracts cluster features based on Gaussian mixture models and tracks their spatial and temporal changes without incorporating them into the clustering process. This approach allows the application of different methods for comparing and tracking similar and changing cluster properties. We provide verification and runtime analysis on a synthetic dataset and experimental evaluation on a climatology dataset of satellite observations demonstrating a performant method to track clusters with changing spatio-temporal features.

Paper Nr: 7
Title:

Unsupervised Evaluation of Human Translation Quality

Authors:

Yi Zhou and Danushka Bollegala

Abstract: Even though machine translation (MT) systems have reached impressive performances in cross-lingual translation tasks, the quality of MT is still far behind professional human translations (HTs) due to the complexity in natural languages, especially for terminologies in different domains. Therefore, HTs are still widely demanded in practice. However, the quality of HT is also imperfect and vary significantly depending on the experience and knowledge of the translators. Evaluating the quality of HT in an automatic manner has faced many challenges. Although bilingual speakers are able to assess the translation quality, manually checking the accuracy of translations is expensive and time-consuming. In this paper, we propose an unsupervised method to evaluate the quality of HT without requiring any labelled data. We compare a range of methods for automatically grading HTs and observe the Bidirectional Minimum Word Mover’s distance (BiMWMD) to produce gradings that correlate well with humans.

Paper Nr: 11
Title:

Past-future Mutual Information Estimation in Sparse Information Conditions

Authors:

Yuval Shalev and Irad Ben-Gal

Abstract: We introduce the CT-PFMI, a context tree based algorithm that estimates the past-future mutual information (PFMI) between different time series. By applying a pruning phase of the context tree algorithm, uninformative past sequences are removed from PFMI estimation along with their false contributions. In situations where most of the past data is uninformative, the CT-PFMI shows better estimates to the true PFMI than other benchmark methods as demonstrated in a simulated study. By implementing CT-PFMI on real stock prices data we also demonstrate how the algorithm provides useful insights when analyzing the interactions between financial time series.

Paper Nr: 12
Title:

Detecting Aberrant Linking Behavior in Directed Networks

Authors:

Dorit S. Hochbaum, Quico Spaen and Mark Velednitsky

Abstract: Agents with aberrant behavior are commonplace in today’s networks. There are fake profiles in social media, malicious websites on the internet, and fake news sources that are prolific in spreading misinformation. The distinguishing characteristic of networks with aberrant agents is that normal agents rarely link to aberrant ones. Based on this manifested behavior, we propose a directed Markov Random Field (MRF) formulation for detecting aberrant agents. The formulation balances two objectives: to have as few links as possible from normal to aberrant agents, as well as to deviate minimally from prior information (if given). The MRF formulation is solved optimally and efficiently. We compare the optimal solution for the MRF formulation to existing algorithms, including PageRank, TrustRank, and AntiTrustRank. To assess the performance of these algorithms, we present a variant of the modularity clustering metric that overcomes the known shortcomings of modularity in directed graphs. We show that this new metric has desirable properties and prove that optimizing it is NP-hard. In an empirical experiment with twenty-three different datasets, we demonstrate that the MRF method outperforms the other detection algorithms.

Paper Nr: 14
Title:

Augmented Semantic Explanations for Collaborative Filtering Recommendations

Authors:

Mohammed Alshammari and Olfa Nasraoui

Abstract: Collaborative Filtering techniques provide the ability to handle big and sparse data to predict the rating for unseen items with high accuracy. However, they fail to justify their output. The main objective of this paper is to present a novel approach that employs Semantic Web technologies to generate explanations for the output of black box recommender systems. The proposed model significantly outperforms state-of-the-art baseline models in terms of the error rate. Moreover, it produces more explainable items than all baseline approaches.

Paper Nr: 16
Title:

Multi-Objective Optimization for Automated Business Process Discovery

Authors:

Mohamed A. Ghazal, Samy Ghoniemy and Mostafa A. Salama

Abstract: Process Mining is a research field that aims to develop new techniques to discover, monitor and improve real processes by extracting knowledge from event logs. This relatively young research discipline has evidenced efficacy in various applications, especially in application domains where a dynamic behavior needs to be related to process models. Process Model Discovery is presumably the most important task in Process Mining since the discovered models can be used as an objective starting points for any further process analysis to be conducted. There are various quality dimensions the model should consider during discovery such as Replay-Fitness, Precision, Generalization, and Simplicity. It becomes evident that Process Model Discovery, with its current given settings, is a Multi-Objective Optimization Problem. However, most existing techniques does not approach the problem as a Multi-Objective Optimization Problem. Therefore, in this work we propose the use of one of the most robust and widely used Multi-Objective Optimizers in Process Model Discovery, the NSGA-II algorithm. Experimental results on a real life event log shows that the proposed technique outperforms existing techniques in various aspects. Also this work tries to establish a benchmarking system for comparing results of Multi-Objective Optimization based Process Model Discovery techniques.

Paper Nr: 17
Title:

A Supervised Multi-class Multi-label Word Embeddings Approach for Toxic Comment Classification

Authors:

Salvatore Carta, Andrea Corriga, Riccardo Mulas, Diego R. Recupero and Roberto Saia

Abstract: Nowadays, communications made by using the modern Internet-based opportunities have revolutionized the way people exchange information, allowing real-time discussions among a huge number of users. However, the advantages offered by such powerful instruments of communication are sometimes jeopardized by the dangers related to personal attacks that lead many people to leave a discussion that they were participating. Such a problem is related to the so-called toxic comments, i.e., personal attacks, verbal bullying and, more generally, an aggressive way in which many people participate in a discussion, which brings some participants to abandon it. By exploiting the Apache Spark big data framework and several word embeddings, this paper presents an approach able to operate a multi-class multi-label classification of a discussion within a range of six classes of toxicity. We evaluate such an approach by classifying a dataset of comments taken from the Wikipedia’s talk page, according to a Kaggle challenge. The experimental results prove that, through the adoption of different sets of word embeddings, our supervised approach outperforms the state-of-the-art that operate by exploiting the canonical bag-of-word model. In addition, the adoption of a word embeddings defined in a similar scenario (i.e., discussions related to e-learning videos), proves that it is possible to improve the performance with respect to solutions employing state-of-the-art word embeddings.

Paper Nr: 24
Title:

Active Learning and User Segmentation for the Cold-start Problem in Recommendation Systems

Authors:

Rabaa Alabdulrahman, Herna Viktor and Eric Paquet

Abstract: Recommendation systems, which are employed to mitigate the information overload faced by e-commerce users, have succeeded in aiding customers during their online shopping experience. However, to be able to make accurate recommendations, these systems require information about the items for sale and information about users’ individual preferences. Making recommendations to new customers, with no prior data in the system, is therefore challenging. This scenario, called the “cold-start problem,” hinders the accuracy of recommendations made to a new user. In this paper, we introduce the popular users personalized predictions (PUPP) framework to address cold-starts. In this framework, soft clustering and active learning is used to accurately recommend items to new users. Experimental evaluation shows that the PUPP framework results in high performance and accurate predictions. Further, focusing on frequent, or so-called “popular,” users during our active-learning stage clearly benefits the learning process.

Paper Nr: 25
Title:

Detect the Unexpected: Novelty Detection in Large Astrophysical Surveys using Fisher Vectors

Authors:

Michael Rotman, Itamar Reis, Dovi Poznanski and Lior Wolf

Abstract: Finding novelties in an untagged high dimensional dataset poses an open question. In this work, we present an innovative method for detecting such novelties using Fisher Vectors. Our dataset distribution is modeled using a Gaussian Mixture Model. An anomaly score that stems from the theory of Fisher Vector is computed for each of the samples. We compute the anomaly score on the SDSS galaxies spectra dataset and present the different types of novelties found. We compare our findings with other outlier detection algorithms from the literature, and demonstrate the ability of our method to distinguish between samples taken from intersecting probability distributions.

Paper Nr: 27
Title:

Chemical Named Entity Recognition with Deep Contextualized Neural Embeddings

Authors:

Zainab Awan, Tim Kahlke, Peter J. Ralph and Paul J. Kennedy

Abstract: Chemical named entity recognition (ChemNER) is a preliminary step in chemical information extraction pipelines. ChemNER has been approached using rule-based, dictionary-based, and feature-engineered based machine learning, and more recently also deep learning based methods. Traditional word-embeddings, like word2vec and Glove, are inherently problematic because they ignore the context in which an entity appears. Contextualized embeddings called embedded language models (ELMo) have been recently introduced to represent contextual information of a word in its embedding space. In this work, we quantify the impact of contextualized embeddings for ChemNER by using Bi-LSTM-CRF (bidirectional long short term memory networks - conditional random fields) networks. We benchmarked our approach using four well-known corpora for chemical named entity recognition. Our results show that incorporation of ELMo results in statistically significant improvements in F1 score in all of the tested datasets.

Paper Nr: 36
Title:

Model Driven Extraction of NoSQL Databases Schema: Case of MongoDB

Authors:

Amal A. Brahim, Rabah T. Ferhat and Gilles Zurfluh

Abstract: Big Data have received a great deal of attention in recent years. Not only the amount of data is on a completely different level than before but also, we have different type of data including factors such as format, structure, and sources. This has definitely changed the tools we need to handle Big Data, giving rise to NoSQL systems. While NoSQL systems have proven their efficiency to handle Big Data, it’s still an unsolved problem how the extraction of a NoSQL database model could be done. This paper proposes an automatic approach for extracting a physical model starting from a document-oriented NoSQL database, including links between different collections. In order to demonstrate the practical applicability of our work, we have realized it in a tool using the Eclipse Modeling Framework environment.

Paper Nr: 41
Title:

Vocab Learn: A Text Mining System to Assist Vocabulary Learning

Authors:

Jingwen Wang, Changfeng Yu, Wenjing Yang and Jie Wang

Abstract: We present a text mining system called Vocab Learn to assist users to learn new words with respect to a knowledge base, where a knowledge base is a collection of written materials. Vocab Learn extracts words, excluding stop words, from a knowledge base and recommends new words to a user according to their importance and frequency. To enforce learning and assess how well a word is learned, Vocab Learn generates, for each word recommended, a number of semantically close words using word embeddings (Mikolov et al., 2013a), and a number of words with look-alike spellings/strokes but with different meanings using Minimum Edit Distance (Levenshtein, 1966). Moreover, to help learn how to use a new word, Vocab Learn links each word to its dictionary definitions and provides sample sentences extracted from the knowledge base that includes the word. We carry out experiments to compare word-ranking algorithms of TFIDF (Salton and McGill, 1986), TextRank (Mihalcea and Tarau, 2004), and RAKE (Rose et al., 2010) over the dataset of Inspec abstracts in Computer Science and Information Technology Journals with a set of keywords labeled by human editors. We show that TextRank would be the best choice for ranking words for this dataset. We also show that Vocab Learn generates reasonable words with similar meanings and words with similar spellings but with different meanings.

Paper Nr: 55
Title:

Functional Annotation of Proteins using Domain Embedding based Sequence Classification

Authors:

Bishnu Sarker, David W. Ritchie and Sabeur Aridhi

Abstract: Due to the recent advancement in genomic sequencing technologies, the number of protein sequences in public databases is growing exponentially. The UniProt Knowledgebase (UniProtKB) is currently the largest and most comprehensive resource for protein sequence and annotation data. The May 2019 release of the Uniprot Knowledge base (UniprotKB) contains around 158 million protein sequences. For the complete exploitation of this huge knowledge base, protein sequences need to be annotated with functional properties such as Enzyme Commission (EC) numbers and Gene Ontology terms. However, there is only about half a million sequences (UniprotKB/SwissProt) are reviewed and functionally annotated by expert curators using information extracted from the published literature and computational analyses. The manual annotation by experts are expensive, slow and insufficient to fill the gap between the annotated and unannotated protein sequences. In this paper, we present an automatic functional annotation technique using neural network based based word embedding exploiting domain and family information of proteins. Domains are the most conserved regions in protein sequences and constitute the building blocks of 3D protein structures. To do the experiment, we used fastText1, a library for learning of word embeddings and text classification developed by Facebook’s AI Research lab. The experimental results show that domain embeddings perform much better than k-mer based word embeddings.

Paper Nr: 57
Title:

The Added Value of OLAP Preferences in What-If Applications

Authors:

Mariana Carvalho and Orlando Belo

Abstract: These days, enterprise managers involved with decision-making processes struggle with several problems related to market position or business reputation of their companies. Collecting data and retrieving high-quality set of information is one of the main priority tasks of enterprise managers involved in decision-making processes. To overcome the difficulties that may arise from market competitiveness and gain some kind of competitive advantage, it is important that these managers make the most of adequate tools in order to get the right set of highly refined information. What-If analysis can help managers getting the competitive advantage they need. It allows for simulating hypothetical scenarios and analyzing the consequences of specific changes without harming business activities. In this paper, we propose a hybridization methodology, which combines What-If analysis process with OLAP usage preferences, for optimizing decision processes. We present and discuss the integration of OLAP usage preferences in the conventional What-If analysis with a case example.

Paper Nr: 60
Title:

Enhancements for Sliding Window based Stream Classification

Authors:

Engin Maden and Pinar Karagoz

Abstract: In stream mining, there are several limitations on the classification process, since the time and resource are limited. The data is read only once and the whole history of data can not be stored. There are several methods developed so far such as stream based adaptations of decision trees, nearest-neighbor methods and neural network classifiers. This paper presents new enhancements on sliding window based classification methods. As the first modification, we use the traditional kNN (K-Nearest Neighbors) method in a sliding window and include the mean of the previous instances as a nearest neighbor instance. By this, we aim to associate the behaviour pattern coming from the past and current state of data. We call this method as m-kNN (Mean extended kNN). As the second enhancement, we generate an ensemble classifier as the combination of our m-kNN with traditional kNN and Naive Bayes classifier. We call this method CSWB (Combined Sliding Window Based) classifier. We present the accuracy of our methods on several datasets in comparison to the results against the state-of-the-art classifiers MC-NN (Micro Cluster Nearest Neighbor) and VHT (Vertical Hoeffding Tree). The results reveal that the proposed method performs better for several data sets and have potential for further improvement.

Paper Nr: 63
Title:

Sports Analytics: Maximizing Precision in Predicting MLB Base Hits

Authors:

Pedro Alceo and Roberto Henriques

Abstract: As the world of sports expands to never seen levels, so does the necessity for tools which provided material advantages for organizations and other stakeholders. The main objective of this paper is to build a predictive model capable of predicting what are the odds of a baseball player getting a base hit on a given day, with the intention of both winning the game Beat the Streak and to provide valuable information for the coaching staff. Using baseball statistics, weather forecasts and ballpark characteristics several models were built with the CRISP-DM architecture. The main constraints considered when building the models were balancing, outliers, dimensionality reduction, variable selection and the type of algorithm – Logistic Regression, Multi-layer Perceptron, Random Forest and Stochastic Gradient Descent. The results obtained were positive, in which the best model was a Multi-layer Perceptron with an 85% correct pick ratio.

Paper Nr: 73
Title:

A Discretized Enriched Technique to Enhance Machine Learning Performance in Credit Scoring

Authors:

Roberto Saia, Salvatore Carta, Diego R. Recupero, Gianni Fenu and Marco Saia

Abstract: The automated credit scoring tools play a crucial role in many financial environments, since they are able to perform a real-time evaluation of a user (e.g., a loan applicant) on the basis of several solvency criteria, without the aid of human operators. Such an automation allows who work and offer services in the financial area to take quick decisions with regard to different services, first and foremost those concerning the consumer credit, whose requests have exponentially increased over the last years. In order to face some well-known problems related to the state-of-the-art credit scoring approaches, this paper formalizes a novel data model that we called Discretized Enriched Data (DED), which operates by transforming the original feature space in order to improve the performance of the credit scoring machine learning algorithms. The idea behind the proposed DED model revolves around two processes, the first one aimed to reduce the number of feature patterns through a data discretization process, and the second one aimed to enrich the discretized data by adding several meta-features. The data discretization faces the problem of heterogeneity, which characterizes such a domain, whereas the data enrichment works on the related loss of information by adding meta-features that improve the data characterization. Our model has been evaluated in the context of real-world datasets with different sizes and levels of data unbalance, which are considered a benchmark in credit scoring literature. The obtained results indicate that it is able to improve the performance of one of the most performing machine learning algorithm largely used in this field, opening up new perspectives for the definition of more effective credit scoring solutions.

Paper Nr: 75
Title:

Modeling Concept Drift in the Context of Discrete Bayesian Networks

Authors:

Hatim Alsuwat, Emad Alsuwat, Marco Valtorta, John Rose and Csilla Farkas

Abstract: Concept drift is a significant challenge that greatly influences the accuracy and reliability of machine learning models. There is, therefore, a need to detect concept drift in order to ensure the validity of learned models. In this research, we study the issue of concept drift in the context of discrete Bayesian networks. We propose a probabilistic graphical model framework to explicitly detect the presence of concept drift using latent variables. We employ latent variables to model real concept drift and uncertainty drift over time. For modeling real concept drift, we propose to monitor the mean of the distribution of the latent variable over time. For modeling uncertainty drift, we suggest to monitor the change in beliefs of the latent variable over time, i.e., we monitor the maximum value that the probability density function of the distribution takes over time. We implement our proposed framework and present our empirical results using two of the most commonly used Bayesian networks in Bayesian experiments, namely the Burglary-Earthquake Network and the Chest Clinic network.

Paper Nr: 77
Title:

CupQ: A New Clinical Literature Search Engine

Authors:

Jesse Wang and Henry Kautz

Abstract: A new clinical literature search engine, called CupQ, is presented. It aims to help clinicians stay updated with medical knowledge. Although PubMed is currently one of the most widely used digital libraries for biomedical information, it frequently does not return clinically relevant results. CupQ utilizes a ranking algorithm that filters non-medical journals, compares semantic similarity between queries, and incorporates journal impact factor and publication date. It organizes search results into useful categories for medical practitioners: reviews, guidelines, and studies. Qualitative comparisons suggest that CupQ may return more clinically relevant information than PubMed. CupQ is available at https://cupq.io/.

Paper Nr: 78
Title:

Topological Approach for Finding Nearest Neighbor Sequence in Time Series

Authors:

Paolo Avogadro and Matteo A. Dominoni

Abstract: The aim of this work is to obtain a good quality approximation of the nearest neighbor distance (nnd) profile among sequences of a time series. The knowledge of the nearest neighbor distance of all the sequences provides useful information regarding, for example, anomalies and clusters of a time series, however the complexity of this task grows quadratically with the number of sequences, thus limiting its possible application. We propose here an approximate method which allows one to obtain good quality nnd profiles faster (1-2 orders of magnitude) than the brute force approach and which exploits the interdependence of three different topologies of a time series, one induced by the SAX clustering procedure, one induced by the position in time of each sequence and one by the Euclidean distance. The quality of the approximation has been evaluated with real life time series, where more than 98% of the nnd values obtained with our approach are exact and the average relative error for the approximated ones is usually below 10%.

Short Papers
Paper Nr: 1
Title:

Data Preparation for Fuzzy Modelling using Intervals

Authors:

Arthur Yosef, Moti Schneider, Eli Shnaider, Amos Baranes and Rimona Palas

Abstract: Model-building professionals are often facing a very difficult choice of selecting relevant variable/s from a set of several similar variables. All those variables are supposedly representing the same factor but are measured differently. They are based on different methodologies, baselines, conversion/comparability methods, etc., thus leading to substantial differences in numerical values for essentially the same things. In this study we introduce a method that utilizes intervals to capture all the relevant variables that represent the same factor. First, we discuss the advantages utilizing intervals of values from the stand point of reliability, better and more efficient data utilization, as well as substantial reduction in the complexity, and thus improvement in our ability to interpret the results. In addition, we introduce an interval (range) reduction algorithm, designed to reduce excessive size of intervals, thus bringing them closer to their central tendency cluster. Following the theoretical component, we present a case study. The case study demonstrates the process of converting the data into intervals for two broad economic variables (each consisting of several data series) and two broad financial variables. Furthermore, it demonstrates the practical application of the procedures addressed in this study and their effectiveness.

Paper Nr: 4
Title:

Case Study on Model-based Application of Machine Learning using Small CAD Databases for Cost Estimation

Authors:

Stefan Börzel and Jörg Frochte

Abstract: In many industries, the development is aimed towards Industry 4.0, which is accompanied by a movement from large to small quantities of individually adapted products in a multitude of variants. In this scenario, it is essential to be able to provide the price for these small batches fast and without additional costs to the customer. This is a big challenge in technical applications in which this price calculation is in general performed by local experts. From the age of expert systems, one knows how hard it is to achieve a formalised model-based on expert knowledge. So it makes sense to use today’s machine learning techniques. Unfortunately, the small batches combined with typically small and midsize production enterprises (SMEs) lead to smaller databases to rely on. This comes along with data which is often based on 3D data or other sources that lead in the first step to a lot of features. In this paper, we present an approach for such use cases that combines the advantages of model-based approaches with modern machine learning techniques, as well as a discussion on feature generation from CAD data and reduction to a low-dimensional representation of the customer requests.

Paper Nr: 5
Title:

Effective Frequent Motif Discovery for Long Time Series Classification: A Study using Phonocardiogram

Authors:

Hajar Alhijailan and Frans Coenen

Abstract: A mechanism for extracting frequent motifs from long time series is proposed, directed at classifying phonocardiograms. The approach features two preprocessing techniques: silent gap removal and a novel candidate frequent motif discovery mechanism founded on the clustering of time series subsequences. These techniques were combined into one process for extracting discriminative frequent motifs from single time series and then to combine these to identify a global set of discriminative frequent motifs. The proposed approach compares favourably with these existing approaches in terms of accuracy and has a significantly improved runtime.

Paper Nr: 6
Title:

Sub-Sequence-Based Dynamic Time Warping

Authors:

Mohammed Alshehri, Frans Coenen and Keith Dures

Abstract: In time series classification the most commonly used approach is k Nearest Neighbor classification, where k = 1, coupled with Dynamic Time Warping (DTW) similarity checking. A challenge is that the DTW process is computationally expensive. This paper presents a new approach for speeding-up the DTW process, Sub-Sequence-Based DTW, which offers the additional benefit of improving accuracy. This paper also presents an analysis of the impact of the Sub-Sequence-Based method in terms of efficiency and effectiveness in comparison with standard DTW and the Sakoe-Chiba Band technique.

Paper Nr: 9
Title:

Information Retrieval in a Concept Lattice by using Uncertain Logical Gates

Authors:

Guillaume Petiot

Abstract: Formal Concept Analysis (FCA) is an approach of data mining which consists in extracting formal concepts in order to provide a hierarchy of concepts also called a concept lattice. It is useful for understanding data. A formal concept is a set of objects which share the same properties. When the number of formal concepts is too high, it is difficult to explore all formal concepts in order to look for information. The use of a query to extract relevant information is a solution to this problem. A logical combination of Boolean criteria, which can be represented by a logical circuit, can serve as the condition of the query. In the uncertain formal context, we are not sure if the objects own a property. As a consequence, we must take into account uncertainties in the computation of formal concepts and in queries. We propose in this paper to use possibility theory to handle these uncertainties. As a result, we compute a necessity degree for each formal concept. We can use a query in which the condition can be computed by using possibilistic networks and uncertain logical gates. Finally, we illustrate our approach by the analysis of a satisfaction questionnaire for a course in bachelor.

Paper Nr: 15
Title:

An Efficient, Robust, and Customizable Information Extraction and Pre-processing Pipeline for Electronic Health Records

Authors:

Eva K. Lee, Yuanbo Wang, Yuntian He and Brent M. Egan

Abstract: Electronic Health Records (EHR) containing large amounts of patient data present both opportunities and challenges to industry, policy makers, and researchers. These data, when extracted and analyzed effectively, can reveal critical factors that can improve clinical practices and decisions. However, the inherently complex, heterogeneous and rapidly evolving nature of these data make them extremely difficult to analyze effectively. In addition, Protected Health Information (PHI) containing sensitive yet valuable information for clinical research must first be anonymized. In this paper we identify current challenges with obtaining and pre-processing information from EHR. We then present a comprehensive, efficient “pipeline” for extracting, de-identifying, and standardizing EHR data. We demonstrate the use of this pipeline, based on software from EPIC Systems, in analysing chronic kidney disease, prostate cancer, and cardiovascular disease. We also address challenges associated with temporal laboratory time series data and natural text data and develop a novel approach for clustering irregular Multivariate Time Series (MTS). The pipeline organizes data into a structured, machine-readable format which can be effectively applied in clinical research studies to optimize processes, personalize care, and improve quality, and outcomes.

Paper Nr: 20
Title:

Quality of Wikipedia Articles: Analyzing Features and Building a Ground Truth for Supervised Classification

Authors:

Elias Bassani and Marco Viviani

Abstract: Wikipedia is nowadays one of the biggest online resources on which users rely as a source of information. The amount of collaboratively generated content that is sent to the online encyclopedia every day can let to the possible creation of low-quality articles (and, consequently, misinformation) if not properly monitored and revised. For this reason, in this paper, the problem of automatically assessing the quality of Wikipedia articles is considered. In particular, the focus is (i) on the analysis of groups of hand-crafted features that can be employed by supervised machine learning techniques to classify Wikipedia articles on qualitative bases, and (ii) on the analysis of some issues behind the construction of a suitable ground truth. Evaluations are performed, on the analyzed features and on a specifically built labeled dataset, by implementing different supervised classifiers based on distinct machine learning algorithms, which produced promising results.

Paper Nr: 32
Title:

Social Tracks: Recommender System for Multiple Individuals using Social Influence

Authors:

Lesly G. Camacho, João K. Faria, Solange N. Alves-Souza and Lucia L. Filgueiras

Abstract: The number of data generated through interactions within a social network, or interactions within a platform resources (eg. clicks, hits, purchases), grow exponentially over time. The popularization of social networks and the increase of interactions allow data to be analyzed to predict the tastes and desires of consumers. The use of recommendation systems to filter content based on the characteristics and tastes of a user is already widespread and applied across platforms. However, the application of recommendation systems to multiple individuals is a less explored field. For this project, data was gathered from social networks to recommend music playlists to a group of individuals. Listening to music as a group is a common activity, be it with friends, couples or in parties. Social network data are used to identify the social influence of the individuals in the group. In addition, to identify the preferences, the characteristics of the songs most frequently heard by the members of the group are assembled. Matrix factorization is used to predict group interests. Proposed influence factor, based on similarity, leadership and expertise, is added to compute a final recommendation. A social network was created to support the controlled experiment, the results show the prediction made by the system vary of 1,455 of the ratings made by the group' members.

Paper Nr: 33
Title:

Multimodal Ranked Search over Integrated Repository of Radiology Data Sources

Authors:

Priya Deshpande, Alexander Rasin, Fang Cao, Sriram Yarlagadda, Eli Brown, Jacob Furst and Daniela S. Raicu

Abstract: Radiology teaching files serve as a reference in the diagnosis process and as a learning resource for radiology residents. Many public teaching file data sources are available online and private in-house repositories are maintained in most hospitals. However, the native interfaces for querying public repositories have limited capabilities. The Integrated Radiology Image Search (IRIS) Engine was designed to combine public data sources and in-house teaching files into a single resource. In this paper, we present and evaluate ranking strategies that prioritize the most relevant teaching files for a query. We quantify query context through a weighted text-based search and with ontology integration. We also incorporate an image-based search that allows finding visually similar teaching files. Finally, we augment text-based search results with image-based search – a hybrid approach that further improves search result relevance. We demonstrate that this novel approach to searching radiology data produces promising results by evaluating it with an expert panel of reviewers and by comparing our search performance against other publicly available search engines.

Paper Nr: 37
Title:

Measuring Context Change to Detect Statements Violating the Overton Window

Authors:

Christian Kahmann and Gerhard Heyer

Abstract: The so-called Overton window describes the phenomenon that political discourse takes place in a narrow window of terms that reflect the public consensus of acceptable opinions on some topic. In this paper we present a novel NLP approach to identify statements in a collection of newspaper articles that shift the borders of the Overton window at some period of time, and apply it on German newspaper texts detecting extreme statements about the refugee crisis in Germany.

Paper Nr: 38
Title:

Comparison of Querying Performance of Neo4j on Graph and Hyper-graph Data Model

Authors:

Mert Erdemir, Furkan Goz, Alev Mutlu and Pinar Karagoz

Abstract: Graph databases are gaining wide use as they provide flexible mechanisms to model real world entities and the relationships among them. In the literature, there exists several studies that evaluate performance of graph databases and graph database query languages. However, there is limited work on comparing performance for graph database querying under different graph representation models. In this study, we focus on two graph representation models: ordinary graphs vs. hyper-graphs, and investigate the querying performance of Neo4j for various query types under each model. The analysis conducted on a benchmark data set reveal what type of queries perform better on each representation.

Paper Nr: 42
Title:

Discussion-skill Analytics with Acoustic, Linguistic and Psychophysiological Data

Authors:

Katashi Nagao, Kosuke Okamoto, Shimeng Peng and Shigeki Ohira

Abstract: In this paper, we propose a system for improving the discussion skills of participants in a meeting by automatically evaluating statements in the meeting and effectively feeding back the results of the evaluation to them. To evaluate the skills automatically, the system uses both acoustic features and linguistic features of statements. It evaluates the way a person speaks, such as their “voice size,” on the basis of the acoustic features, and it also evaluates the contents of a statement, such as the “consistency of context,” on the basis of linguistic features. These features can be obtained from meeting minutes. Since it is difficult to evaluate the semantic contents of statements such as the “consistency of context,” we build a machine learning model that uses the features of minutes such as speaker attributes and the relationship of statements. In addition, we argue that participants’ heart rate (HR) data can be used to effectively evaluate their cognitive performance, specifically the performance in a discussion that consists of several Q&A segments (question-and-answer pairs). We collect HR data during a discussion in real time and generate machine-learning models for evaluation. We confirmed that the proposed method is effective for evaluating the discussion skills of meeting participants.

Paper Nr: 46
Title:

User Experience-based Information Retrieval from Semistar Data Ontologies

Authors:

Edgars Rencis

Abstract: The time necessary for the doubling of medical knowledge is rapidly decreasing. In such circumstances, it is of utmost importance for the information retrieval process to be rapid, convenient and straightforward. However, it often lacks at least one of these properties. Several obstacles prohibit domain experts extracting knowledge from their databases without involving the third party in the form of IT professionals. The main limitation is usually the complexity of querying languages and tools. This paper proposes the approach of using a keywords-containing natural language for querying the database and exploiting the system that could automatically translate such queries to already existing target language that has an efficient implementation upon the database. The querying process is based on data conforming to a Semistar data ontology that has proven to be a very easily perceptible data structure for domain experts. Over time, the system can learn from the user actions, thus making the translation more accurate and the querying – more straightforward.

Paper Nr: 47
Title:

Evaluation of Couchbase, CouchDB and MongoDB using OSSpal

Authors:

André Calçada and Jorge Bernardino

Abstract: Document stores are a NoSQL (Not Only Structured Query Language) database type, made to deal with big amounts of data and considered to be very developer-friendly, but they have also attracted a large interest from researchers and practitioners. In this paper, we analyze and evaluate three of the most popular document stores Couchbase, CouchDB, and MongoDB, the evaluation is based on their functionalities. In this evaluation, we use the OSSpal methodology, which combines quantitative and qualitative measures to evaluate the software. We conclude that of these open-source document stores the best is MongoDB, followed by Couchbase, and CouchDB.

Paper Nr: 48
Title:

Data Mining Techniques for Early Detection of Breast Cancer

Authors:

Maria I. Cruz and Jorge Bernardino

Abstract: Nowadays, millions of people around the world are living with the diagnosis of cancer, so it is very important to investigate some forms of detection and prevention of this disease. In this paper, we will use an ensemble technique with some data mining algorithms applied to a dataset related to the diagnosis of breast cancer using biological markers found in routine blood tests, in order to diagnose this disease. From the results obtained, it can be verified that the model got an AUC of 95% and a precision of 87%. Thus, through this model it is possible to create new screening tools to assist doctors and prevent healthy patients from having to undergo invasive examinations.

Paper Nr: 51
Title:

Sentiment Analysis of Web Trends for the Antisocial Behaviour Detection

Authors:

Kristína Machová and Ján Birka

Abstract: The paper presents an approach to extraction of current web trends for research into automated recognition of antisocial behaviour in online discussions. Antisocial behaviour is a drawback of online discussions as compared to their advantages such as wisdom of crowds and collective intelligence. The first step to recognition of antisocial behaviour is the identification of web trends connected with it. These are studied in dynamic conditions using sentiment analysis as a webometric. A new sentiment analysis method based on a lexicon was developed. Two modifications of the lexicon sentiment analysis method were designed and tested involving NLP (natural language processing) and an original technique for negations and intensifications processing. The most effective sentiment classification method was used for the extraction of web trends. Extracted web trends were analysed in a dynamic way and findings of this analysis were compared to known historical events.

Paper Nr: 52
Title:

Deep Semantic Feature Detection from Multispectral Satellite Images

Authors:

Hanen Balti, Nedra Mellouli, Imen Chebbi, Imed R. Farah and Myriam Lamolle

Abstract: Recent progress in satellite technology has resulted in explosive growth in volume and quality of high-resolution remote sensing images. To solve the issues of retrieving high-resolution remote sensing (RS) data in both efficiency and precision, this paper proposes a distributed system architecture for object detection in satellite images using a fully connected neural network. On the one hand, to address the issue of higher computational complexity and storage ability, the Hadoop framework is used to handle satellite image data using parallel architecture. On the other hand, deep semantic features are extracted using Convolutional Neural Network (CNN),in order to identify objects and accurately locate them. Experiments are held out on several datasets to analyze the efficiency of the suggested distributed system. Experimental results indicate that our system architecture is simple and sustainable, both efficiency and precision can satisfy realistic requirements.

Paper Nr: 59
Title:

Discovering the Geometry of Narratives and their Embedded Storylines

Authors:

Eduard Hoenkamp

Abstract: Many of us struggle to keep up with fast evolving news stories, viral tweets, or e-mails demanding our attention. Previous studies have tried to contain such overload by reducing the amount of information reaching us, make it easier to cope with the information that does reach us, or help us decide what to do with the information once delivered. Instead, the approach presented here is to mitigate the overload by uncovering and presenting only the information that is worth looking at. We posit that the latter is encapsulated as an underlying storyline that obeys several intuitive cognitive constraints. The paper assesses the efficacy of the two main paradigms of Information Retrieval, the document space model and language modeling, in how well each captures the intuitive idea of a storyline, seen as a stream of topics. The paper formally defines topics as high-dimensional but sparse elements of a (Grassmann) manifold, and storyline as a trajectory connecting these elements. We show how geometric optimization can isolate the storyline from a stationary low dimensional story background. The approach is effective and efficient in producing a compact representation of the information stream, to be subsequently conveyed to the end-user.

Paper Nr: 61
Title:

Preventing Failures by Predicting Students’ Grades through an Analysis of Logged Data of Online Interactions

Authors:

Bruno Cabral and Álvaro Figueira

Abstract: Nowadays, students commonly use and are assessed through an online platform. New pedagogy theories that promote the active participation of students in the learning process, and the systematic use of problem-based learning, are being adopted using an eLearning system for that purpose. However, although there can be intense feedback from these activities to students, usually it is restricted to the assessments of the online set of tasks. We propose a model that informs students of abnormal deviations of a “correct” learning path. Our approach is based on the vision that, by obtaining this information earlier in the semester, may provide students and educators an opportunity to resolve an eventual problem regarding the student’s current online actions towards the course. In the major learning management systems available, the interaction between the students and the system, is stored in log. Our proposal uses that logged information, and new one computed by our methodology, such as the time each student spends on an activity, the number and order of resources used, to build a table that a machine learning algorithm can learn from. Results show that our model can predict with more than 86% accuracy the failing situations.

Paper Nr: 62
Title:

Association and Temporality between News and Tweets

Authors:

Vânia Moutinho, Pavel Brazdil and João Cordeiro

Abstract: With the advent of social media, the boundaries of mainstream journalism and social networks are becoming blurred. User-generated content is increasing, and hence, journalists dedicate considerable time searching platforms such as Facebook and Twitter to announce, spread, and monitor news and crowd check information. Many studies have looked at social networks as news sources, but the relationship and interconnections between this type of platform and news media have not been thoroughly investigated. In this work, we have studied a series of news articles and examined a set of related comments on a social network during a period of six months. Specifically, a sample of articles from generalist Portuguese news sources published on the first semester of 2016 was clustered, and the resulting clusters were then associated with tweets of Portuguese users with the recourse to a similarity measure. Focusing on a subset of clusters, we have performed a temporal analysis by examining the evolution of the two types of documents (articles and tweets) and the timing of when they appeared. It appears that for some stories, namely Brexit and the European Football Cup, the publishing of news articles intensifies on key dates (event-oriented), while the discussion on social media is more balanced throughout the months leading up to those events.

Paper Nr: 66
Title:

Process Profiling based Synthetic Event Log Generation

Authors:

Eren Esgin and Pinar Karagoz

Abstract: The goal of process mining is to discover the process behavior from the runtime information of process executions. Having labeled process logs is crucial for process mining research. However, real life event logs at process-aware information systems are mostly partially assigned to case identifiers, known as unlabeled event log problem. As a remedy to labeled data need in process mining research, we propose an approach to generate synthetic event logs according to the provided process profile, which outlines the activity vocabulary and structure of the corresponding business process. We evaluate the performance of our prototypical implementation in term of compatible log generation under varying parameter setting complexities.

Paper Nr: 68
Title:

Representational Capacity of Deep Neural Networks: A Computing Study

Authors:

Bernhard Bermeitinger, Tomas Hrycej and Siegfried Handschuh

Abstract: There is some theoretical evidence that deep neural networks with multiple hidden layers have a potential for more efficient representation of multidimensional mappings than shallow networks with a single hidden layer. The question is whether it is possible to exploit this theoretical advantage for finding such representations with help of numerical training methods. Tests using prototypical problems with a known mean square minimum did not confirm this hypothesis. Minima found with the help of deep networks have always been worse than those found using shallow networks. This does not directly contradict the theoretical findings—it is possible that the superior representational capacity of deep networks is genuine while finding the mean square minimum of such deep networks is a substantially harder problem than with shallow ones.

Paper Nr: 72
Title:

A New Temporal Recommendation System based on Users’ Similarity Prediction

Authors:

Nima Joorabloo, Mahdi Jalili and Yongli Ren

Abstract: Recommender systems have significant applications in both industry and academia. Neighbourhood-based collaborative Filtering methods are the most widely used recommenders in industrial applications. These algorithms utilize preferences of similar users to provide suggestions for a target user. Users’ preferences often vary over time and many traditional collaborative filtering algorithms fail to consider this important issue. In this paper, a novel recommendation method is proposed based on predicting similarity between users in the future and forecasting their similarity trends over time. The proposed method uses the sequence of users’ ratings to predict the similarities between users in the future and use the predicted similarities instead of the original ones to detect users’ neighbours. Experimental results on benchmark datasets show that the proposed method significantly outperforms classical and state-of-the-art recommendation methods.

Paper Nr: 8
Title:

Towards Machine Comprehension of Arabic Text

Authors:

Ahmad M. Eid, Nagwa El-Makky and Khaled Nagi

Abstract: Machine Comprehension (MC) is a novel task of question answering (QA) discipline. MC tests the ability of the machine to read a text and comprehend its meaning. Deep learning in MC manages to build an end-to-end paradigm based on new neural networks to directly compute the deep semantic matching among question, answers, and the corresponding passage. Deep learning gives state-of-the-art performance results for English MC. The MC problem has not been addressed yet for the Arabic language due to the lack of Arabic MC datasets. This paper presents the first Arabic MC dataset that results from the translation of the SQuAD v1.1 dataset and applying a proposed approach that combines partial translation post-editing and semi-supervised learning. We intend to make this dataset publicly available for the research community. Furthermore, we use the resultant dataset to build an end-to-end deep learning Arabic MC models, which showed promising results.

Paper Nr: 10
Title:

Sustainable Development Goal Attainment Prediction: A Hierarchical Framework using Time Series Modelling

Authors:

Yassir Alharbi, Daniel Arribas-Be and Frans Coenen

Abstract: A framework is presented which can be used to forecast weather an individual geographic area will meet its UN Sustainable Development Goals, or not, at some time t. The framework comprises a bottom up hierarchical classification system where the leaf nodes hold forecast models and the intermediate nodes and root node “logical and” operators. Features of the framework include the automated generation of the: associated taxonomy, the threshold values with which leaf node prediction values will be compared and the individual forecast models. The evaluation demonstrates that the proposed framework can be successfully employed to predict whether individual geographic areas will meet their SDGs.

Paper Nr: 13
Title:

Causal Learning to Discover Supply Chain Vulnerability

Authors:

Ying Zhao, Jacob Jones and Douglas MacKinnon

Abstract: This paper illustrates a methodology of causal learning using pair-wise associations discovered from data. Taking advantage of a U.S. Department of Defense supply chain use case, this causal learning approach was substantiated and demonstrated in the application of discovering supply chain vulnerabilities. By integrating lexical link analysis, a data mining tool used to discover relationships in specific vocabularies or lexical terms with pair-wise causal learning, supply chain vulnerabilities were recognized. Evaluation of results from this methodology reveals supply chain opportunities, while exposing weaknesses to develop a more responsive and efficient supply chain system.

Paper Nr: 18
Title:

A Discretized Extended Feature Space (DEFS) Model to Improve the Anomaly Detection Performance in Network Intrusion Detection Systems

Authors:

Roberto Saia, Salvatore Carta, Diego R. Recupero, Gianni Fenu and Maria M. Stanciu

Abstract: The unbreakable bond that exists today between devices and network connections makes the security of the latter a crucial element for our society. For this reason, in recent decades we have witnessed an exponential growth in research efforts aimed at identifying increasingly efficient techniques able to tackle this type of problem, such as the Intrusion Detection System (IDS). If on the one hand an IDS plays a key role, since it is designed to classify the network events as normal or intrusion, on the other hand it has to face several well-known problems that reduce its effectiveness. The most important of them is the high number of false positives related to its inability to detect event patterns not occurred in the past (i.e. zero-day attacks). This paper introduces a novel Discretized Extended Feature Space (DEFS) model that presents a twofold advantage: first, through a discretization process it reduces the event patterns by grouping those similar in terms of feature values, reducing the issues related to the classification of unknown events; second, it balances such a discretization by extending the event patterns with a series of meta-information able to well characterize them. The approach has been evaluated by using a real-world dataset (NSL-KDD) and by adopting both the in-sample/out-of-sample and time series cross-validation strategies in order to avoid that the evaluation is biased by over-fitting. The experimental results show how the proposed DEFS model is able to improve the classification performance in the most challenging scenarios (unbalanced samples), with regard to the canonical state-of-the-art solutions.

Paper Nr: 19
Title:

A Methodological Framework for Dictionary and Rule-based Text Classification

Authors:

Jennifer Abel and Birger Lantow

Abstract: Recent research on dictionary- and rule-based text classification either concentrates on improving the classification quality for standard tasks like sentiment mining or describe applications to a specific domain. The focus is mainly on the underlying algorithmic approach. This work in contrast provides a general methodological approach to dictionary- and rule-based text classification based on a systematic literature analysis. The result is a process description that enables the application of these technologies on specific problems by guidance through major decision points from the definition of the classification goals to the actual classification of texts.

Paper Nr: 28
Title:

Comparison Method of Long-term Daily Life Considering the Manner of Spending a Day

Authors:

Takahiko Shintani, Tadashi Ohmori and Hideyuki Fujita

Abstract: Recently, a large amount of physical activity data has been obtained via wearable sensors collected as lifelogs. The long-term daily lives of users can be understood using such long-term activity data. In this paper, we investigate a method for comparing two distinct periods of the daily life of a user to understand the long-term characteristics of that user’s daily life. Our method uses only activity data that can be collected easily and continuously over the long term using wearable sensors. There are various ways in which humans can spend a day, and a period of daily life consists of a set of several days spent in several manners. We compare two periods of daily life by considering the manner in which a day is spent. The manner in which a day is spent can be distinguished based on the activities that are performed on a given day. The amount of movement differs depending on the activity, and similar amounts of movement are measured when similar activities are performed. We focus on this point to classify how each day is spent. Further, we distinguish the manner in which a day is spent based on similarities in the time series data with respect to the levels of the activities by noting the main sleeping period, which is an important behavior in daily human life. We propose a method to compare two distinct periods of daily life based on the distribution of the manner in which a day is spent in each period. The effectiveness of the methods proposed in this paper is evaluated via experimental evaluations using real datasets.

Paper Nr: 31
Title:

Deep Learning Analysis for Big Remote Sensing Image Classification

Authors:

Imen Chebbi, Nedra Mellouli, Myriam lamolle and Imed R. Farah

Abstract: Large data remote sensing has various special characteristics, including multi-source, multi-scale, large scale, dynamic and non-linear characteristics. Data set collections are so large and complex that it becomes difficult to process them using available database management tools or traditional data processing applications. In addition, traditional data processing techniques have different limitations in processing massive volumes of data, as the analysis of large data requires sophisticated algorithms based on machine learning and deep learning techniques to process the data in real time with great accuracy and efficiency. Therefore Deep learning methods are used in various domains such as speech recognition, image classifications, and learning methods in language processing. However, recent researches merged different deep learning techniques with hybrid learning-training mechanisms and processing data with high speed. In this paper we propose a hybrid approach for RS image classification combining a deep learning algorithm and an explanatory classification algorithm. We show how deep learning techniques can benefit to Big remote sensing. Through deep learning we seek to extract relevant features from images via a DL architecture. Then these characteristics are the entry points for the MLlib classification algorithm to understand the correlations that may exist between characteristics and classes. This architecture combines Spark RDD image coding to consider image’s local regions, pre-trained Vggnet and U-net for image segmentation and spark Machine Learning like random Forest and KNN to achieve labeling task.

Paper Nr: 34
Title:

Multivariate Time Series Forecasting with Deep Learning Proceedings in Energy Consumption

Authors:

Nédra Mellouli, Mahdjouba Akerma, Minh Hoang, Denis Leducq and Anthony Delahaye

Abstract: We propose to study the dynamic behavior of indoor temperature and energy consumption in a cold room during demand response periods. Demand response is a method that consists of smoothing demand over time, seeking to reduce or even stop consumption during periods of high demand in order to shift it to periods of lower demand. Such a system can therefore be tackled as the study of a time-series, where each behavioral parameter is a time-varying parameter. Different network topologies are considered, as well as existing approaches for solving multi-step ahead prediction problems. The predictive performance of short-term predictors is also examined with regard to prediction horizon. The performance of the predictors are evaluated using measured data from real scale buildings, showing promising results for the development of accurate prediction tools.

Paper Nr: 40
Title:

Visual Analysis of Architectural Heritage: The Interior Décor of the Domus of Roman Tunisia

Authors:

Aida Hermi-Nasr, Najla Allani and Jean-Yves Blaise

Abstract: This paper aims to propose a new approach which can bring a renewal of the means of study of the Domus of Roman Tunisia. We had chosen thirty roman houses built from 146 (B.C) to 439. These houses are spread over 19 Tunisian cities. The paper is based on an approach called information modelling, which is situated at the interface of the architectural modelling and the Information. This study is built on a numerical implementation, which allows us to test some methods of analysis based on the group of the gathered cases. It proceeds, following a method inventory prior to these houses to a comparative analysis by focusing on the mechanisms of visual comparison between the Domus. In one hand, the study tries to structure and save large volumes of data which are generally heterogeneous, doubtful, incomplete and sometimes contradictory. In the other hand, the study attempts to preserve the history of the architectural evolutions. Applying this method of work on the suggested cases permits to focus on the regularities and the individual and collective evolutions, to emphasize the convergences and the divergences between the edifices and the periods. Finally, it permits to improve the exchange of knowledge between experts.

Paper Nr: 49
Title:

Text Mining in Hotel Reviews: Impact of Words Restriction in Text Classification

Authors:

Diogo Campos, Rodrigo R. Silva and Jorge Bernardino

Abstract: Text Mining is the process of extracting interesting and non-trivial patterns or knowledge from unstructured text documents. Hotel Reviews are used by hotels to verify client satisfaction regarding their own services or facilities. However, we can’t deal with this type of big and unstructured data manually, so we should use OLAP techniques and Text Cube for modelling and manage text data. But then, we have a problem, we must separate the reviews in two classes, positive and negative, and for that, we use Sentiment Analysis technique. Nevertheless, do we really need all the words of a review to make the right classification? In this paper, we will study the impact of word restriction on text classification. To do that, we create some words domains (words that belong to a Hotel Domain). First, we use an algorithm that will pre-process the text (where we use our created domains like stop words). In the experimental evaluation, we use four classifiers to classify the text, Naïve-Bayes, Decision-Tree, Random-Forest, and Support Vector Machine.

Paper Nr: 54
Title:

Open Source Business Intelligence Tools: Metabase and Redash

Authors:

Bruno Santos, Francisco Sério, Steven Abrantes, Filipe Sá, Jorge Loureiro, Cristina Wanzeler and Pedro Martins

Abstract: This electronic document is an article that explores the capabilities of Business Intelligence tools, primarily their ability to analyze generated business data gathered from a company. These corporations can improve (or even create) their products according to the insights provided by these platforms, with the possibility of outclassing their direct competitors, something to be proved crucial for an ever-evolving market. In this article, we have tested and compared two of the most promising open-source BI platforms currently available: these are Metabase and Redash. Our focus is to analyze what they offer as a package, where we defined some key points, such as: overall performance, search engine compatibility, key features, etc. May we remind that the implementation of a platform of choice, concerning BI software, may vary according to company demands. Some tools may be more suitable for a corporation, while others may be the best choice for a different entity.

Paper Nr: 56
Title:

Sentiment Analysis for Arabizi: Application to Algerian Dialect

Authors:

Asma Chader, Dihia Lanasri, Leila Hamdad, Mohamed E. Belkheir and Wassim Hennoune

Abstract: Sentiment Analysis and its applications have spread to many languages and domains. With regard to Arabic and its dialects, we witness an increasing interest simultaneously with increase of Arabic texts volume in social media. However, the Algerian dialect had received little attention, and even less in Latin script (Arabizi). In this paper, we propose a supervised approach for sentiment analysis of Arabizi Algerian dialect using different classifiers such as Naive Bayes and Support Vector Machines. We investigate the impact of several preprocessing techniques, dealing with dialect specific aspects. Experimental evaluation on three manually annotated datasets shows promising performance where the approach yielded the highest classification accuracy using SVM algorithm. Moreover, our results emphasize the positive impact of proposed preprocessing techniques. The adding of vowels removal and transliteration, to overcome phonetic and orthographic varieties, allowed us to lift the F-score of SVM from 76 % to 87 %, which is considerable.

Paper Nr: 64
Title:

CoSky: A Practical Method for Ranking Skylines in Databases

Authors:

Hana Alouaoui, Lotfi Lakhal, Rosine Cicchetti and Alain Casali

Abstract: Discovering Skylines in Databases have been actively studied to effectively identify optimal tuples/objects with respect to a set of designated preference attributes. Few approaches have been proposed for ranking the skylines to resolve the problem of the high cardinality of the result set. The most recent approach to rank skylines is the dp-idp (dominance power-inverse dominance power) which extensively uses the Pareto-dominance relation to determine the score of each skyline. The dp-idp method is in the very same spirit as tf-idf weighting scheme from Information Retrieval. In this paper, we firstly make an Enrichment of dp-idp with Dominance Hierarchy to facilitate the determination of Skyline scores, we propose then the CoSky method (Cosine Skylines) for fast ranking skylines in Databases without computing the Pareto-dominance relation. Cosky is a TOPSIS-like method (Technique for Order of Preference by Similarity to Ideal Solution) resulting from the cross-fertilization between the fields of Information Retrieval, Multiple Criteria Decision Analysis, and Databases. The innovative features of CoSky are principally: the automatic weighting of the normalized attributes based on Gini index, the score of each skyline using the Saltons cosine of the angle between each skyline object and the ideal object, and its direct implementation into any RDBMS without further structures. Finally, we propose the algorithm DeepSky, a Multilevel skyline algorithm based on CoSky method to find Top-k ranked Skylines.

Paper Nr: 67
Title:

Comparison of Naïve Bayes, Support Vector Machine, Decision Trees and Random Forest on Sentiment Analysis

Authors:

Márcio Guia, Rodrigo R. Silva and Jorge Bernardino

Abstract: Every day, we deal with a lot of information on the Internet. This information can have origin from many different places such as online review sites and social networks. In the midst of this messy data, arises the opportunity to understand the subjective opinion about a text, in particular, the polarity. Sentiment Analysis and Text Classification helps to extract precious information about data and assigning a text into one or more target categories according to its content. This paper proposes a comparison between four of the most popular Text Classification Algorithms - Naive Bayes, Support Vector Machine, Decision Trees and Random Forest - based on the Amazon Unlocked mobile phone reviews dataset. Moreover, we also study the impact of some attributes (Brand and Price) on the polarity of the review. Our results demonstrate that the Support Vector Machine is the most complete algorithm of this study and achieve the highest values in all the metrics such as accuracy, precision, recall, and F1 score.

Paper Nr: 70
Title:

Things You Might Not Know about the k-Nearest Neighbors Algorithm

Authors:

Aleksandra Karpus, Marta Raczyńska and Adam Przybylek

Abstract: Recommender Systems aim at suggesting potentially interesting items to a user. The most common kind of Recommender Systems is Collaborative Filtering which follows an intuition that users who liked the same things in the past, are more likely to be interested in the same things in the future. One of Collaborative Filtering methods is the k Nearest Neighbors algorithm which finds k users who are the most similar to an active user and then it computes recommendations based on the subset of users. The main aim of this paper is to compare two implementations of k Nearest Neighbors algorithm, i.e. from Mahout and LensKit libraries, as well as six similarity measures. We investigate how implementation differences between libraries influence optimal neighborhood size k and prediction error. We also show that measures like F1-score and nDCG are not always a good choice for choosing the best neighborhood size k. Finally, we compare different similarity measures according to the average time of generating recommendations and the prediction error.

Paper Nr: 71
Title:

Collaboration Spotting Cite: An Exploration System for the Bibliographic Information of Publications and Patents

Authors:

André Rattinger, Jean-Marie L. Goff and Christian Guetl

Abstract: Collaboration Spotting is a knowledge discovery web platform that visualizes linked data as graphs. This platform enables users to perform operations to manipulate the graph to see and explore different facets of complex networks with multiple node and edge types. It combines information retrieval and graph analysis to effectively explore arbitrary data-sets. The platform is designed in a way that non-expert users without data science knowledge can explore it. For this, the data has to be specifically crafted in a form of a schema. The paper explores the platform in a bibliometrics context and demonstrates its search and relevance feedback mechanisms which can be applied through the navigation of an underlying knowledge graph based on publication and patent metadata. This demonstrates a novel way to interactively explore linked datasets through the combination of visual analytics for graphs with the combination of relevance feedback.

Paper Nr: 81
Title:

Visualization Techniques for Network Analysis and Link Analysis Algorithms

Authors:

Ying Zhao, Ralucca Gera, Quinn Halpin and Jesse Zhou

Abstract: Military applications require big distributed, disparate, multi-sourced and real-time data that have extremely high rates, high volumes, and diverse types. Warfighters need deep models including big data analytics, network analysis, link analysis, deep learning, machine learning, and artificial intelligence to transform big data into smart data. Explainable deep models will play a more essential role for future warfighters to understand, interpret, and therefore appropriately trust, and effectively manage an emerging generation of artificially intelligent machine partners when facing complex threats. In this paper, we show how visualization is used in two typical deep models with two use cases: network analysis, which addresses how to display and present big data both in the exploratory and discovery process, and link analysis, which addresses how to display and present the smart data generated from these processes. By using various visualization tools such as D3, Tableau, and lexical link analysis, we derive useful information from discovering big networks to discovering big data patterns and anomalies. These visualizations become intepretable and explainable deep models that can be readily used by warfighters and decision makers to achieve the sense making and decision making superiority.