KDIR 2015 Abstracts


Full Papers
Paper Nr: 17
Title:

Temporal-based Feature Selection and Transfer Learning for Text Categorization

Authors:

Fumiyo Fukumoto and Yoshimi Suzuki

Abstract: This paper addresses text categorization problem that training data may derive from a different time period from the test data. We present a method for text categorization that minimizes the impact of temporal effects. Like much previous work on text categorization, we used feature selection. We selected two types of informative terms according to corpus statistics. One is temporal independent terms that are salient across full temporal range of training documents. Another is temporal dependent terms which are important for a specific time period. For the training documents represented by independent/dependent terms, we applied boosting based transfer learning to learn accurate model for timeline adaptation. The results using Japanese data showed that the method was comparable to the current state-of-the-art biased-SVM method, as the macro-averaged F-score obtained by our method was 0.688 and that of biased-SVM was 0.671. Moreover, we found that the method is effective, especially when the creation time period of the test data differs greatly from that of the training data.

Paper Nr: 20
Title:

Social Network Analysis for Predicting Emerging Researchers

Authors:

Syed Masum Billah and Susan Gauch

Abstract: Finding rising stars in academia early in their careers has many implications when hiring new faculty, applying for promotion, and/or requesting grants. Typically, the impact and productivity of a researcher are assessed by a popular measurement called the h-index that grows linearly with the academic age of a researcher. Therefore, h-indices of researchers in the early stages of their careers are almost uniformly low, making it difficult to identify those who will, in future, emerge as influential leaders in their field. To overcome this problem, we make use of social network analysis to identify young researchers most likely to become successful as measured by their h-index. We assume that the co-authorship graph reveals a great deal of information about the potential of young researchers. We built a social network of 62,886 researchers using the data available in CiteSeerx. We then designed and trained a linear SVM classifier to identify emerging authors based on their personal attributes and/or their networks of co-authors. We evaluated our classifier’s ability to predict the future research impact of a set of 26,170 young researchers, those with an h-index of less than or equal to two in 2005. By examining their actual impact six years later, we demonstrate that the success of young researchers can be predicted more accurately based on their professional network than their established track records.

Paper Nr: 22
Title:

POP: A Parallel Optimized Preparation of Data for Data Mining

Authors:

Christian Ernst, Youssef Hmamouche and Alain Casali

Abstract: In light of the fact that data preparation has a substantial impact on data mining results, we provide an original framework for automatically preparing the data of any given database. Our research focuses, for each attribute of the database, on two points: (i) Specifying an optimized outlier detection method, and (ii), Identifying the most appropriate discretization method. Concerning the former, we illustrate that the detection of an outlier depends on if data distribution is normal or not. When attempting to discern the best discretization method, what is important is the shape followed by the density function of its distribution law. For this reason, we propose an automatic choice for finding the optimized discretization method based on a multi-criteria (Entropy, Variance, Stability) evaluation. Processings are performed in parallel using multicore capabilities. Conducted experiments validate our approach, showing that it is not always the very same discretization method that is the best.

Paper Nr: 27
Title:

Assessing Vertex Relevance based on Community Detection

Authors:

Paul Parau, Camelia Lemnaru and Rodica Potolea

Abstract: The community structure of a network conveys information about the network as a whole, but it can also provide insightful information about the individual vertices. Identifying the most relevant vertices in a network can prove to be useful, especially in large networks. In this paper, we explore different alternatives for assessing the relevance of a vertex based on the community structure of the network. We distinguish between two relevant vertex properties - commitment and importance - and propose a new measure for quantifying commitment, Relative Commitment. We also propose a strategy for estimating the importance of a vertex, based on observing the disruption caused by removing it from the network. Ultimately, we propose a vertex classification strategy based on commitment and importance, and discuss the aspects covered by each of the two properties in capturing the relevance of a vertex.

Paper Nr: 28
Title:

An Approach for Semantic Search over Lithuanian News Website Corpus

Authors:

Tomas Vileiniškis, Algirdas Šukys and Rita Butkienė

Abstract: The continuous growth of unstructured textual information on the web implies the need for novel, semantically aware content processing and information retrieval (IR) methods. Following the evolution and wide adoption of Semantic Web technology, a number of approaches to overcome the limitations of traditional keyword-based search techniques have been proposed. However, most of the research concentrates on English and other well-known, linguistic resource-rich languages. Hence, this paper presents an attempt to semantic search over domain-specific Lithuanian web documents. We introduce an ontology-based semantic search framework capable of answering structured natural Lithuanian language questions and discuss its language-dependent design decisions. The findings from a recent case study showed that our proposed framework can be applied to approach meaning-based IR with significant results, even when the underlying language is morphologically rich and has limited linguistic resources.

Paper Nr: 30
Title:

Data Driven Structural Similarity - A Distance Measure for Adaptive Linear Approximations of Time Series

Authors:

Victor Ionescu, Rodica Potolea and Mihaela Dinsoreanu

Abstract: Much effort has been invested in recent years in the problem of detecting similarity in time series. Most work focuses on the identification of exact matches through point-by-point comparisons, although in many real-world problems recurring patterns match each other only approximately. We introduce a new approach for identifying patterns in time series, which evaluates the similarity by comparing the overall structure of candidate sequences instead of focusing on the local shapes of the sequence and propose a new distance measure ABC (Area Between Curves) that is used to achieve this goal. The approach is based on a data-driven linear approximation method that is intuitive, offers a high compression ratio and adapts to the overall shape of the sequence. The similarity of candidate sequences is quantified by means of the novel distance measure, applied directly to the linear approximation of the time series. Our evaluations performed on multiple data sets show that our proposed technique outperforms similarity search approaches based on the commonly referenced Euclidean Distance in the majority of cases. The most significant improvements are obtained when applying our method to domains and data sets where matching sequences are indeed primarily determined based on the similarity of their higher-level structures.

Paper Nr: 53
Title:

News Classifications with Labeled LDA

Authors:

Yiqi Bai and Jie Wang

Abstract: Automatically categorizing news articles with high accuracy is an important task in an automated quick news system. We present two classifiers to classify news articles based on Labeled Latent Dirichlet Allocation, called LLDA-C and SLLDA-C. To verify classification accuracy we compare classification results obtained by the classifiers with those by trained professionals. We show that, through extensive experiments, both LLDA-C and SLLDA-C outperform SVM (Support Vector Machine, our baseline classifier) on precisions, particularly when only a small training dataset is available. SSLDA-C is also much more efficient than SVM. In terms of recalls, we show that LLDA-C is better than SVM. In terms of average Macro-F1 and Micro-F1 scores, we show that LLDA classifiers are superior over SVM. To further explore classifications of news articles we introduce the notion of content complexity, and study how content complexity would affect classifications.

Paper Nr: 54
Title:

Piecewise Chebyshev Factorization based Nearest Neighbour Classification for Time Series

Authors:

Qinglin Cai, Ling Chen and Jianling Sun

Abstract: In the research field of time series analysis and mining, the nearest neighbour classifier (1NN) based on dynamic time warping distance (DTW) is well known for its high accuracy. However, the high computational complexity of DTW can lead to the expensive time consumption of classification. An effective solution is to compute DTW in the piecewise approximation space (PA-DTW), which transforms the raw data into the feature space based on segmentation, and extracts the discriminatory features for similarity measure. However, most of existing piecewise approximation methods need to fix the segment length, and focus on the simple statistical features, which would influence the precision of PA-DTW. To address this problem, we propose a novel piecewise factorization model for time series, which uses an adaptive segmentation method and factorizes the subsequences with the Chebyshev polynomials. The Chebyshev coefficients are extracted as features for PA-DTW measure (ChebyDTW), which are able to capture the fluctuation information of time series. The comprehensive experimental results show that ChebyDTW can support the accurate and fast 1NN classification.

Paper Nr: 66
Title:

Domain-Specific Relation Extraction - Using Distant Supervision Machine Learning

Authors:

Abduladem Aljamel, Taha Osman and Giovanni Acampora

Abstract: The increasing accessibility and availability of online data provides a valuable knowledge source for information analysis and decision-making processes. In this paper we argue that extracting information from this data is better guided by domain knowledge of the targeted use-case and investigate the integration of a knowledge-driven approach with Machine Learning techniques in order to improve the quality of the Relation Extraction process. Targeting the financial domain, we use Semantic Web Technologies to build the domain Knowledgebase, which is in turn exploited to collect distant supervision training data from semantic linked datasets such as DBPedia and Freebase. We conducted a serious of experiments that utilise the number of Machine Learning algorithms to report on the favourable implementations/configuration for successful Information Extraction for our targeted domain.

Paper Nr: 87
Title:

Big Graph-based Data Visualization Experiences - The WordNet Case Study

Authors:

Enrico Caldarola, Antonio Picariello and Antonio M. Rinaldi

Abstract: In the Big Data era, the visualization of large data sets is becoming an increasingly relevant task due to the great impact that data have from a human perspective. Since visualization is the closer phase to the users within the data life cycle's phases, there is no doubt that an effective, efficient and impressive representation of the analyzed data may result as important as the analytic process itself. This paper presents an experience for importing, querying and visualizing graph database and in particular, we describe as a case study the WordNet database using Neo4J and Cytoscape. We will describe each step in this study focusing on the used strategies for overcoming the different problems mainly due to the intricate nature of the case study. Finally, an attempt to define some criteria to simplify the large-scale visualization of WordNet will be made, providing some examples and considerations which have arisen.

Paper Nr: 88
Title:

Bringing Search Engines to the Cloud using Open Source Components

Authors:

Khaled Nagi

Abstract: The usage of search engines is nowadays extended to do intelligent analytics of petabytes of data. With Lucene being at the heart of the vast majority of information retrieval systems, several attempts are made to bring it to the cloud in order to scale to big data. Efforts include implementing scalable distribution of the search indices over the file system, storing them in NoSQL databases, and porting them to inherently distributed ecosystems, such as Hadoop. We evaluate the existing efforts in terms of distribution, high availability, fault tolerance, manageability, and high performance. We believe that the key to supporting search indexing capabilities for big data can only be achieved through the use of common open-source technology to be deployed on standard cloud platforms such as Amazon EC2, Microsoft Azure, etc. For each approach, we build a benchmarking system by indexing the whole Wikipedia content and submitting hundreds of simultaneous search requests. We measure the performance of both indexing and searching operations. We stimulate node failures and monitor the recoverability of the system. We show that a system built on top of Solr and Hadoop has the best stability and manageability; while systems based on NoSQL databases present an attractive alternative in terms of performance.

Paper Nr: 93
Title:

Markov Chain based Method for In-Domain and Cross-Domain Sentiment Classification

Authors:

Giacomo Domeniconi, Gianluca Moro, Andrea Pagliarani and Roberto Pasolini

Abstract: Sentiment classification of textual opinions in positive, negative or neutral polarity, is a method to understand people thoughts about products, services, persons, organisations, and so on. Interpreting and labelling opportunely text data polarity is a costly activity if performed by human experts. To cut this labelling cost, new cross domain approaches have been developed where the goal is to automatically classify the polarity of an unlabelled target text set of a given domain, for example movie reviews, from a labelled source text set of another domain, such as book reviews. Language heterogeneity between source and target domain is the trickiest issue in cross-domain setting so that a preliminary transfer learning phase is generally required. The best performing techniques addressing this point are generally complex and require onerous parameter tuning each time a new source-target couple is involved. This paper introduces a simpler method based on the Markov chain theory to accomplish both transfer learning and sentiment classification tasks. In fact, this straightforward technique requires a lower parameter calibration effort. Experiments on popular text sets show that our approach achieves performance comparable with other works.

Paper Nr: 100
Title:

Word Sense Discrimination on Tweets: A Graph-based Approach

Authors:

Flavio Massimiliano Cecchini, Elisabetta Fersini and Enza Messina

Abstract: In this paper we are going to detail an unsupervised, graph-based approach for word sense discrimination on tweets. We deal with this problem by constructing a word graph of co-occurrences. By defining a distance on this graph, we obtain a word metric space, on which we can apply an aggregative algorithm for word clustering. As a result, we will get word clusters representing contexts that discriminate the possible senses of a term. We present some experimental results both on a data set consisting of tweets we collected and on the data set of task 14 at SemEval-2010.

Paper Nr: 101
Title:

Leveraging Entity Linking to Enhance Entity Recognition in Microblogs

Authors:

Pikakshi Manchanda, Elisabetta Fersini and Matteo Palmonari

Abstract: The Web of Data provides abundant knowledge wherein objects or entities are described by means of properties and their relationships with other objects or entities. This knowledge is used extensively by the research community for Information Extraction tasks such as Named Entity Recognition (NER) and Linking (NEL) to make sense of data. Named entities can be identified from a variety of textual formats which are further linked to corresponding resources in the Web of Data. These tasks of entity recognition and linking are, however, cast as distinct problems in the state-of-the-art, thereby, overlooking the fact that performance of entity recognition affects the performance of entity linking. The focus of this paper is to improve the performance of entity recognition on a particular textual format, viz, microblog posts by disambiguating the named entities with resources in a Knowledge Base (KB). We propose an unsupervised learning approach to jointly improve the performance of entity recognition and, thus, the whole system by leveraging the results of disambiguated entities.

Short Papers
Paper Nr: 6
Title:

Using Collective Intelligence to Generate Trend-based Travel Recommendations

Authors:

Sabine Schlick, Isabella Eigner and Alex Fechner

Abstract: Trips are multifaceted, complex products which cannot be tested in advance due to their geographical distance. Hence, making a travel decision people often ask others for advice. This leads to an increasing importance of communities. Within communities people share their experiences, which results in new, more extensive knowledge beyond the individual knowledge of each member. The objective of this paper is to use this knowledge by developing an algorithm that automatically generates trend-based travel recommendations. Based on the travel experiences of the community members, interesting travel areas are identified. Five key figures to evaluate these areas according to general criteria and the users’ individual preferences are developed. The algorithm allows to generate recommendations for the whole community and not only for highly active members, resulting in a high coverage. A study conducted within an online travel community shows that automatically generated, trend-based trip recommendations are rated better than user-generated recommendations.

Paper Nr: 10
Title:

Visualizing Cultural Digital Resources using Social Network Analysis

Authors:

Antonio Capodieci, Daniele D’Aprile, Gianluca Elia, Francesca Grippa and Luca Mainetti

Abstract: This paper describes the design and implementation of a prototype to extract, collect and visually analyse cultural digital resources using social network analysis empowered with semantic features. An initial experiment involved the collection and visualization of connections between cultural digital resources - and their providers - stored in the platform DiCet (an Italian Living Lab centred on Cultural Heritage and Technology). This step helped to identify the most appropriate relational data model to use for the social network visualization phase. We then run a second experiment using a web application designed to extract relevant data from the platform Europeana.eu. The actors in our two-mode networks are Cultural Heritage Objects (CHOs) shared by institutional and individual providers, such as galleries, museums, individual experts and content aggregators. The links connecting nodes represent the digital resources associated to the CHOs. The application of the prototype offers insights on the most prominent providers, digital resources and cultural objects over time. Through the application of semantic analysis, we were also able to identify the most used words and the related sentiment associated to them.

Paper Nr: 12
Title:

Comparison of Sampling Size Estimation Techniques for Association Rule Mining

Authors:

Tuğba Halıcı and Utku Görkem Ketenci

Abstract: Fast and complete retrieval of individual customer needs and "to the point" product offers are crucial aspects of customer satisfaction in todays’ highly competitive banking sector. Growing number of transactions and customers have excessively boosted the need for time and memory in market basket analysis. In this paper, sampling process is included into analysis aiming to increase the performance of a product offer system. The core logic of a sample, is to dig for smaller representative of the universe, that is to generate accurate association rules. A smaller sample of the universe reduces the elapsed time and the memory consumption devoted to market basket analysis. Based on this content; the sampling methods, the sampling size estimation techniques and the representativeness tests are examined. The technique, which gives complete set of association rules in a reduced amount of time, is suggested for sampling retail banking data.

Paper Nr: 23
Title:

Construction of a Bayesian Network as an Extension of Propositional Logic

Authors:

Takuto Enomoto and Masaomi Kimura

Abstract: A Bayesian network is a probabilistic graphical model. Many conventional methods have been proposed for its construction. However, these methods often result in an incorrect Bayesian network structure. In this study, to correctly construct a Bayesian network, we extend the concept of propositional logic. We propose a methodology for constructing a Bayesian network with causal relationships that are extracted only if the antecedent states are true. In order to determine the logic to be used in constructing the Bayesian network, we propose the use of association rule mining such as the Apriori algorithm. We evaluate the proposed method by comparing its result with that of traditional method, such as Bayesian Dirichlet equivalent uniform (BDeu) score evaluation with a hill climbing algorithm, that shows that our method generates a network with more necessary arcs than that generated by the traditional method.

Paper Nr: 24
Title:

Real-Time Prediction to Support Decision-making in Soccer

Authors:

Yasuo Saito, Masaomi Kimura and Satoshi Ishizaki

Abstract: Data analysis in sports has been developing for many years. However, to date, a system that provides tactical prediction in real time and promotes ideas for increasing the chance of winning has not been reported in the literature. Especially, in soccer, components of plays and games are more complicated than in other sports. This study proposes a method to predict the course of a game and create a strategy for the second half. First, we summarize other studies and propose our method. Then, data are collected using the proposed system. From past games, games to similar to a target game are extracted depending on data from their first half. Next, similar games are classified by features depending on data of their second half. Finally, a target game is predicted and tactical ideas are derived. The practicability of the method is demonstrated through experiments. However, further improvements such as increasing the number of past games and types of data are still required.

Paper Nr: 25
Title:

SCUT: Multi-Class Imbalanced Data Classification using SMOTE and Cluster-based Undersampling

Authors:

Astha Agrawal, Herna L. Viktor and Eric Paquet

Abstract: Class imbalance is a crucial problem in machine learning and occurs in many domains. Specifically, the two-class problem has received interest from researchers in recent years, leading to solutions for oil spill detection, tumour discovery and fraudulent credit card detection, amongst others. However, handling class imbalance in datasets that contains multiple classes, with varying degree of imbalance, has received limited attention. In such a multi-class imbalanced dataset, the classification model tends to favour the majority classes and incorrectly classify instances from the minority classes as belonging to the majority classes, leading to poor predictive accuracies. Further, there is a need to handle both the imbalances between classes as well as address the selection of examples within a class (i.e. the so-called within class imbalance). In this paper, we propose the SCUT hybrid sampling method, which is used to balance the number of training examples in such a multi-class setting. Our SCUT approach oversamples minority class examples through the generation of synthetic examples and employs cluster analysis in order to undersample majority classes. In addition, it handles both within-class and between-class imbalance. Our experimental results against a number of multi-class problems show that, when the SCUT method is used for pre-processing the data before classification, we obtain highly accurate models that compare favourably to the state-of-the-art.

Paper Nr: 35
Title:

Use of Frequent Itemset Mining Techniques to Analyze Business Processes

Authors:

Vladimír Bartík and Milan Pospíšil

Abstract: Analysis of business process data can be used to discover reasons of delays and other problems in a business process. This paper presents an approach, which uses a simulator of production history. This simulator allows detecting problems at various production machines, e.g. extremely long queues of products waiting before a machine. After detection, data about products processed before the queue increased are collected. Frequent itemsets obtained from this dataset can be used to describe the problem and reasons of it. The whole process of frequent itemset mining will be described in this paper. It is also focused on description of several necessary modifications of basic methods usually used to discover frequent itemsets.

Paper Nr: 41
Title:

Interesting Regression- and Model Trees Through Variable Restrictions

Authors:

Rikard König, Ulf Johansson, Ann Lindqvist and Peter Brattberg

Abstract: The overall purpose of this paper is to suggest a new technique for creating interesting regression- and model trees. Interesting models are here defined as models that fulfill some domain dependent restriction of how variables can be used in the models. The suggested technique, named ReReM, is an extension of M5 which can enforce variable constraints while creating regression and model trees. To evaluate ReReM, two case studies were conducted where the first concerned modeling of golf player skill, and the second modeling of fuel consumption in trucks. Both case studies had variable constraints, defined by domain experts, that should be fulfilled for models to be deemed interesting. When used for modeling golf player skill, ReReM created regression trees that were slightly less accurate than M5’s regression trees. However, the models created with ReReM were deemed to be interesting by a golf teaching professional while the M5 models were not. In the second case study, ReReM was evaluated against M5’s model trees and a semi-automated approach often used in the automotive industry. Here, experiments showed that ReReM could achieve a predictive performance comparable to M5 and clearly better than a semi-automated approach, while fulfilling the constraints regarding interesting models.

Paper Nr: 47
Title:

When Textual Information Becomes Spatial Information Compatible with Satellite Images

Authors:

Eric Kergosien, Hugo Alatrista-Salas, Mauro Gaio, Fabio Güttler, Mathieu Roche and Maguelonne Teisseire

Abstract: With the amount of textual data available on the web, new methodologies of knowledge extraction domain are provided. Some original methods allow the users to combine different types of data in order to extract relevant information. In this context, we present the cornerstone of manipulations on textual documents and their preparation for extracting compatible spatial information with those contained in satellite images. The term footprint is defined and its extraction is performed. In this paper, we describe the general process and some experiments conducted in the XXX project, which aims to match the information coming from texts with those of satellite images.

Paper Nr: 51
Title:

Automatic Extraction of Task Statements from Structured Meeting Content

Authors:

Katashi Nagao, Kei Inoue, Naoya Morita and Shigeki Matsubara

Abstract: We previously developed a discussion mining system that records face-to-face meetings in detail, analyzes their content, and conducts knowledge discovery. Looking back on past discussion content by browsing documents, such as minutes, is an effective means for conducting future activities. In meetings at which some research topics are regularly discussed, such as seminars in laboratories, the presenters are required to discuss future issues by checking urgent matters from the discussion records. We call statements including advice or requests proposed at previous meetings “task statements” and propose a method for automatically extracting them. With this method, based on certain semantic attributes and linguistic characteristics of statements, a probabilistic model is created using the maximum entropy method. A statement is judged whether it is a task statement according to its probability. A seminar-based experiment validated the effectiveness of the proposed extraction method.

Paper Nr: 52
Title:

The NOESIS Open Source Framework for Network Data Mining

Authors:

Víctor Martínez, Fernando Berzal and Juan-Carlos Cubero

Abstract: NOESIS is a software framework for the development of data mining techniques for networked data. As an open source project, released under a BSD license, NOESIS intends to provide the necessary infrastructure for solving complex network data mining problems. Currently, it includes a large collection of popular network-related data mining techniques, including the analysis of network structural properties, community detection algorithms, link scoring and prediction methods, and network visualization techniques. The design of NOESIS tries to facilitate the development of parallel algorithms using solid object-oriented design principles and structured parallel programming. NOESIS can be used as a stand-alone application, as many other network analysis packages, and can be included, as a lightweight library, in domain-specific data mining applications and systems.

Paper Nr: 57
Title:

POS Tagging-probability Weighted Method for Matching the Internet Recipe Ingredients with Food Composition Data

Authors:

Tome Eftimov and Barbara Korousicg Seljak

Abstract: In this paper, we present a new method that can be used for matching recipe ingredients extracted from the Internet to nutritional data from food composition databases (FCDBs). The method uses part of speech tagging (POS tagging) to capture the information from the names of the ingredients and the names of the food analyses from FCDBs. Then, probability weighted model is presented, which takes into account the information from POS tagging to assign the weight on each match and the match with the highest weight is used as the most relevant one and can be used for further analyses. We evaluated our method using a collection of 721 lunch recipes, from which we extracted 1,615 different ingredients and the result showed that our method can match 91.82% of the ingredients with the FCDB.

Paper Nr: 60
Title:

A Structural Model of Internet Organization Discovery

Authors:

Zi-yu Yang, Xiao-yun Wang, Hong-mei Ma and Li Qin

Abstract: This paper presents a highly structured model to automatically discover Internet organizations from the data of RIR (Regional Internet Registry) and IRR (Internet Routing Registry), where network operators register their networking resources such as IP addresses and routing policies. Our basic idea is to discover network operators that have close ties among each other from those registry activities, and consider them as being from the same organization. With the data from two RIRs, this model produces to date the first organization level network of current Internet. The model shows its reasonability in our validation with real Internet routing data, and is likely to be applied extensively in networking area.

Paper Nr: 61
Title:

The Reverse Doubling Construction

Authors:

Jean-François Viaud, Karell Bertet, Christophe Demko and Rokia Missaoui

Abstract: It is well known inside the Formal Concept Analysis (FCA) community that a concept lattice could have an exponential size in the data. Hence, the size of concept lattices is a critical issue in the presence of large real-life data sets. In this paper, we propose to investigate factor lattices as a tool to get meaningful parts of the whole lattice. These factor lattices have been widely studied from the early theory of lattices to more recent work in the FCA field. This paper contains two parts. The first one gives background about lattice theory and formal concept analysis, and mainly compatible sub-contexts, arrow-closed sub-contexts and congruence relations. The second part presents a new decomposition called “reverse doubling construction” that exploits the above three notions used for the doubling convex construction investigated by Day. Theoretical results and their proofs are given as well as an illustrative example.

Paper Nr: 62
Title:

Tool and Evaluation Method for Idea Creation Support

Authors:

Ryo Takeshima and Katashi Nagao

Abstract: We have developed a new idea creation support tool in which (1) each idea is represented by a tree structure, (2) the idea is automatically evaluated on the basis of the tree structure so that the relative advantages among several alternative ideas is found, (3) the ideas are presented in a poster format, and (4) the ideas are shared by multiple users so that the ideas can be quoted and expanded upon by individual users. In this work, we explain the mechanisms of this tool, including the evaluation and poster conversion of ideas and collaborative idea creation, and briefly discuss our plan for the future.

Paper Nr: 63
Title:

qRead: A Fast and Accurate Article Extraction Method from Web Pages using Partition Features Optimizations

Authors:

Jingwen Wang and Jie Wang

Abstract: We present a new method called qRead to achieve real-time content extractions from web pages with high accuracy. Early approaches to content extractions include empirical filtering rules, Document Object Model (DOM) trees, and machine learning models. These methods, while having met with certain success, may not meet the requirements of real-time extraction with high accuracy. For example, constructing a DOM-tree on a complex web page is time-consuming, and using machine learning models could make things unnecessarily more complicated. Different from previous approaches, qRead uses segment densities and similarities to identify main contents. In particular, qRead first filters obvious junk contents, eliminates HTML tags, and partitions the remaining text into natural segments. It then uses the highest ratio of words over the number of lines in a segment combined with similarity between the segment and the title to identify main contents. We show that, through extensive experiments, qRead achieves a 96.8% accuracy on Chinese web pages with an average extraction time of 13.20 milliseconds, and a 93.6% accuracy on English web pages with an average extraction time of 11.37 milliseconds, providing substantial improvements on accuracy over previous approaches and meeting the real-time extraction requirement.

Paper Nr: 64
Title:

CBIR Search Engine for User Designed Query (UDQ)

Authors:

Tatiana Jaworska

Abstract: At present, most Content-Based Image Retrieval (CBIR) systems use query by example (QBE), but its drawback is the fact that the user first has to find an image which he wants to use as a query. In some situations the most difficult task is to find this one proper image which the user keeps in mind to feed it to the system as a query by example. For our CBIR, we prepared the dedicated GUI to construct a user designed query (UDQ). We describe the new search engine which matches images using both local and global image features for a query composed by the user. In our case, the spatial object location is the global feature. Our matching results take into account the kind and number of objects, their spatial layout and object feature vectors. Finally, we compare our matching result with those obtained by other search engines.

Paper Nr: 73
Title:

Concept Profiles for Filtering Parliamentary Documents

Authors:

Francisco J. Ribadas, Luis M. de Campos, Juan M. Fernández-Luna and Juan F. Huete

Abstract: Content-based recommender/filtering systems help to appropriately distribute information among the individuals or organizations that could consider it of interest. In this paper we describe a filtering system to deal with the problem of assigning documents to members of the parliament potentially interested on them. The proposed approach exploits subjects taken from a conceptual thesaurus to create the user profiles and to describe the documents to be filtered. The assignment of subjects to documents is modeled as a multilabel classification problem. Experiments with a real parliamentary corpus are reported, evaluating several methods to assign conceptual subjects to documents and to match those sets of subjects with user profiles.

Paper Nr: 76
Title:

Prediction of Earnings per Share for Industry

Authors:

Swati Jadhav, Hongmei He and Karl Jenkins

Abstract: Prediction of Earnings Per Share (EPS) is the fundamental problem in finance industry. Various Data Mining technologies have been widely used in computational finance. This research work aims to predict the future EPS with previous values through the use of data mining technologies, thus to provide decision makers a reference or evidence for their economic strategies and business activity. We created three models LR, RBF and MLP for the regression problem. Our experiments with these models were carried out on the real datasets provided by a software company. The performance assessment was based on Correlation Coefficient and Root Mean Squared Error. These algorithms were validated with the data of six different companies. Some differences between the models have been observed. In most cases, Linear Regression and Multilayer Perceptron are effectively capable of predicting the future EPS. But for the high nonlinear data, MLP gives better performance.

Paper Nr: 77
Title:

KISS MIR: Keep It Semantic and Social Music Information Retrieval

Authors:

Amna Dridi and Mouna Kacimi

Abstract: While content-based approaches for music information retrieval (MIR) have been heavily investigated, usercentric approaches are still in their early stage. Existing user-centric approaches use either music-context or user-context to personalize the search. However, none of them give the possibility to the user to choose the suitable context for his needs. In this paper we propose KISS MIR, a versatile approach for music information retrieval. It consists in combining both music-context and user-context to rank search results. The core contribution of this work is the investigation of different types of contexts derived from social networks. We distinguish semantic and social information and use them to build semantic and social profiles for music and users. The different contexts and profiles can be combined and personalized by the user. We have assessed the quality of our model using a real dataset from Last.fm. The results show that the use of user-context to rank search results is two times better than the use of music-context. More importantly, the combination of semantic and social information is crucial for satisfying user needs.

Paper Nr: 84
Title:

Pseudo Relevance Feedback Technique and Semantic Similarity for Corpus-based Expansion

Authors:

Masnizah Mohd, Jaffar Atwan and Kiyoaki Shirai

Abstract: The adaptation of a Query Expansion (QE) approach for Arabic documents may produce the worst rankings or irrelevant results. Therefore, we have introduced a technique, which is to utilise the Arabic WordNet in the corpus and query expansion level. A Point-wise Mutual Information (PMI) corpus-based measure is used to semantically select synonyms from the WordNet. In addition, Automatic Query Expansion (AQE) and Pseudo Relevance Feedback (PRF) methods were also explored to improve the performance of the Arabic information retrieval (AIR) system. The experimental results of our proposed techniques for AIR shows that the use of Arabic WordNet in the corpus and query level together with AQE, and the adaptation of PMI in the expansion process have successfully reduced the level of ambiguity as these techniques select the most appropriate synonym. It enhanced knowledge discovery by taking care of the relevancy aspect. The techniques also demonstrated an improvement in Mean Average Precision by 49%, with an increase of 7.3% in recall in comparison to the baseline.

Paper Nr: 86
Title:

Towards Reusability of Computational Experiments - Capturing and Sharing Research Objects from Knowledge Discovery Processes

Authors:

Armel Lefebvre, Marco Spruit and Wienand Omta

Abstract: Calls for more reproducible research by sharing code and data are released in a large number of fields from biomedical science to signal processing. At the same time, the urge to solve data analysis bottlenecks in the biomedical field generates the need for more interactive data analytics solutions. These interactive solutions are oriented towards wet lab users whereas bioinformaticians favor custom analysis tools. In this position paper we elaborate on why Reproducible Research, by presenting code and data sharing as a gold standard for reproducibility misses important challenges in data analytics. We suggest new ways to design interactive tools embedding constraints of reusability with data exploration. Finally, we seek to integrate our solution with Research Objects as they are expected to bring promising advances in reusability and partial reproducibility of computational work.

Paper Nr: 89
Title:

Data Parsing using Tier Grammars

Authors:

Alexander Sakharov and Timothy Sakharov

Abstract: Parsing turns unstructured data into structured data suitable for knowledge discovery and querying. The complexity of grammar notations and the difficulty of grammar debugging limit the availability of data parsers. Tier grammars are defined by simply dividing terminals into predefined classes and then splitting elements of some classes into multiple layered sub-groups. The set of predefined terminal classes can be easily extended. Tier grammars and their extensions are LL(1) grammars. Tier grammars are a tool for big data preprocessing.

Paper Nr: 90
Title:

Motive-based Search - Computing Regions from Large Knowledge Bases using Geospatial Coordinates

Authors:

Liliya Avdiyenko, Martin Nettling, Christiane Lemke, Matthias Wauer, Axel-Cyrille Ngonga Ngomo and Andreas Both

Abstract: To create a better search experience for end users and to satisfy their actual intents even for vaguely formulated queries, a contemporary search engine has to go beyond simple keyword-based retrieval concepts. For a geospatial search, where user queries can be quite complex such as "places for winter sport holidays and culture in Central Europe", we introduce the notion of geospatial motifs denoting traits of geographical regions. Defining a motif by a set of geospatial entities with certain characteristics, we present an approach to inferring important regions for the motif based on density of these entities. The evaluation of the approach for several motifs showed that the inferred regions are among the most popular places for a motif of interest according to the opinion of several experts and official rankings. Thus, we claim that the presented semi-automatic process of detecting regions for geospatial motifs can contribute to more powerful and flexible search applications which are able to answer user queries containing complex geospatial concepts.

Paper Nr: 92
Title:

A Continuum among Logarithmic, Linear, and Exponential Functions, and Its Potential to Improve Generalization in Neural Networks

Authors:

Luke B. Godfrey and Michael S. Gashler

Abstract: We present the soft exponential activation function for artificial neural networks that continuously interpolates between logarithmic, linear, and exponential functions. This activation function is simple, differentiable, and parameterized so that it can be trained as the rest of the network is trained. We hypothesize that soft exponential has the potential to improve neural network learning, as it can exactly calculate many natural operations that typical neural networks can only approximate, including addition, multiplication, inner product, distance, and sinusoids.

Paper Nr: 97
Title:

Optimizing Dependency Parsing Throughput

Authors:

Albert Weichselbraun and Norman Süsstrunk

Abstract: Dependency parsing is considered a key technology for improving information extraction tasks. Research indicates that dependency parsers spend more than 95% of their total runtime on feature computations. Based on this insight, this paper investigates the potential of improving parsing throughput by designing feature representations which are optimized for combining single features to more complex feature templates and by optimizing parser constraints. Applying these techniques to MDParser increased its throughput four fold, yielding Syntactic Parser, a dependency parser that outperforms comparable approaches by factor 25 to 400.

Posters
Paper Nr: 3
Title:

Predicting Stock Market Movement: An Evolutionary Approach

Authors:

Salah Bouktif and Mamoun Adel Awad

Abstract: Social Networks are becoming very popular sources of all kind of data. They allow a wide range of users to interact, socialize and express spontaneous opinions. The overwhelming amount of exchanged data on businesses, companies and governments make it possible to perform predictions and discover trends in many domains. In this paper we propose a new prediction model for the stock market movement problem based on collective classification. The model is using a number of public mood states as inputs to predict Up and Down movement of stock market. The proposed approach to build such a model is simultaneously promoting performance and interpretability. By interpretability, we mean the ability of a model to explain its predictions. A particular implementation of our approach is based on Ant Colony Optimization algorithm and customized for individual Bayesian classifiers. Our approach is validated with data collected from social media on the stock of a prestigious company. Promising results of our approach are compared with four alternative prediction methods namely, bagging, Adaboost, best expert, and expert trained on all the available data.

Paper Nr: 5
Title:

Chinese-keyword Fuzzy Search and Extraction over Encrypted Patent Documents

Authors:

Wei Ding, Yongji Liu and Jianfeng Zhang

Abstract: Cloud storage for information sharing is likely indispensable to the future national defence library in China e.g., for searching national defence patent documents, while security risks need to be maximally avoided using data encryption. Patent keywords are the high-level summary of the patent document, and it is significant in practice to efficiently extract and search the key words in the patent documents. Due to the particularity of Chinese keywords, most existing algorithms in English language environment become ineffective in Chinese scenarios. For extracting the keywords from patent documents, the manual keyword extraction is inappropriate when the amount of files is large. An improved method based on the term frequency–inverse document frequency (TF-IDF) is proposed to auto-extract the keywords in the patent literature. The extracted keyword sets also help to accelerate the keyword search by linking finite keywords with a large amount of documents. Fuzzy keyword search is introduced to further increase the search efficiency in the cloud computing scenarios compared to exact keyword search methods. Based on the Chinese Pinyin similarity, a Pinyin-Gram-based algorithm is proposed for fuzzy search in encrypted Chinese environment, and a keyword trapdoor search index structure based on the n-ary tree is designed. Both the search efficiency and accuracy of the proposed scheme are verified through computer experiments.

Paper Nr: 19
Title:

Real-time Local Topic Extraction using Density-based Adaptive Spatiotemporal Clustering for Enhancing Local Situation Awareness

Authors:

Tatsuhiro Sakai, Keiichi Tamura, Shota Kotozaki, Tsubasa Hayashida and Hajime Kitakami

Abstract: In the era of big data, we are witnessing the rapid growth of a new type of information source. In particular, tweets are one of the most widely used microblogging services for situation awareness during emergencies. In our previous work, we focused on geotagged tweets posted on Twitter that included location information as well as a time and text message. We previously developed a real-time analysis system using the (ε,τ)-density-based adaptive spatiotemporal clustering algorithm to analyze local topics and events. The proposed spatiotemporal analysis system successfully detects emerging bursty areas in which geotagged tweets related to observed topics are posted actively; however the system is tailor-made and specialized for a particular observed topic, therefore, it cannot identify other topics. To address this issue, we propose a new real-time spatiotemporal analysis system for enhancing local situation awareness using a density-based adaptive spatiotemporal clustering algorithm. In the proposed system, local bursty keywords are extracted and their bursty areas are identified. We evaluated the proposed system using actual real world topics related to weather in Japan. Experimental results show that the proposed system can extract local topics and events.

Paper Nr: 29
Title:

Aftermath of 2008 Financial Crisis on Oil Prices

Authors:

Neha Sehgal and Krishan K. Pandey

Abstract: Geopolitical and economic events had strong impact on crude oil markets for over 40 years. Oil prices steadily rose for several years and in July 2008 stood at a record high of $145 per barrel. Further, it plunged to $43 per barrel by end of 2008. There is need to identify appropriate features (factors) explaining the characteristics of oil markets during booming and downturn period. Feature selection can help in identifying the most informative and influential input variables before and after financial crisis. The study used an extended version of MI3 algorithm i.e. I2MI2 algorithm together with general regression neural network as forecasting engine to examine the explanatory power of selected features and their contribution in driving oil prices. The study used features selected from proposed methodology for one-month ahead and twelve-month ahead forecast horizon. The forecast from the proposed methodology outperformed in comparison to EIA’s STEO estimates. Results shows that reserves and speculations were main players before the crisis and the overall mechanism was broken due to 2008 global financial crisis. The contribution of emerging economy (China) emerged as important variable in explaining the directions of oil prices. EPPI and CPI remain the building blocks before and after crisis while influence of Non-OECD consumption rises after the crisis.

Paper Nr: 31
Title:

Topic Oriented Auto-completion Models - Approaches Towards Fastening Auto-completion Systems

Authors:

Stefan Prisca, Mihaela Dinsoreanu and Camelia Lemnaru

Abstract: In this paper we propose an autocompletion approach suitable for mobile devices that aims to reduce the overall data model size and to speed up query processing while not employing any language specific processing. The approach relies on topic information from input documents to split the data models based on topics and index them in a way that allows fast identification through their corresponding topic. Doing so, the size of the data model used for prediction is decreased to almost one fifth of the size of a model that contains all topics, and the query processing becomes two times faster, while maintaining the same precision obtained by employing a model that contains all topics.

Paper Nr: 32
Title:

CIRA: A Competitive Intelligence Reference Architecture for Dynamic Solutions

Authors:

Marco Spruit and Alex Cepoi

Abstract: Competitive Intelligence (CI) solutions are the key to enabling companies to stay on top of the changes in today’s competitive environment. It may come as a surprise, then, that although Competitive Intelligence solutions already exist for a few decades, there is still little knowledge available regarding the implementation of an automated Competitive Intelligence solution. This research focuses on designing a Competitive Intelligence reference Architecture (CIRA) for dynamic systems. We start by collecting Key Intelligence Topics (KITs) and functional requirements based on an extensive literature review and expert interviews with companies and Competitive Intelligence professionals. Next, we design the architecture itself based on an attribute-driven design method. Then, a prototype is implemented for a company active in the maritime & offshore industry. Finally, the architecture is evaluated by industry experts and their suggestions are incorporated back in the artefact.

Paper Nr: 33
Title:

Clustering Stability and Ground Truth: Numerical Experiments

Authors:

Maria José Amorim and Margarida Cardoso

Abstract: Stability has been considered an important property for evaluating clustering solutions. Nevertheless, there are no conclusive studies on the relationship between this property and the capacity to recover clusters inherent to data (“ground truth”). This study focuses on this relationship resorting to synthetic data generated under diverse scenarios (controlling relevant factors). Stability is evaluated using a weighted cross-validation procedure. Indices of agreement (corrected for agreement by chance) are used both to assess stability and external validation. The results obtained reveal a new perspective so far not mentioned in the literature. Despite the clear relationship between stability and external validity when a broad range of scenarios is considered, within-scenarios conclusions deserve our special attention: faced with a specific clustering problem (as we do in practice), there is no significant relationship between stability and the ability to recover data clusters.

Paper Nr: 34
Title:

Biomedical Question Types Classification using Syntactic and Rule based Approach

Authors:

Mourad Sarrouti, Abdelmonaime Lachkar and Said Alaoui Ouatik

Abstract: Biomedical Question Types (QTs) Classification is an important component of Biomedical Question Answering Systems and it attracted a notable amount of research since the past decade. Biomedical QTs Classification is the task for determining the QTs to a given Biomedical Question. It classifies Biomedical Questions into several Questions Types. Moreover, the Question Types aim to determine the appropriate Answer Extraction Algorithms. In this paper, we have proposed an effective and efficient method for Biomedical QTs Classification. We have classified the Biomedical Questions into three broad categories. We have also defined the Syntactic Patterns for particular category of Biomedical Questions. Therefore, using these Biomedical Question Patterns, we have proposed an algorithm for classifying the question into particular category. The proposed method was evaluated on the Benchmark datasets of Biomedical Questions. The experimental results show that the proposed method can be used to effectively classify Biomedical Questions with higher accuracy.

Paper Nr: 42
Title:

Annotating Real Time Twitter’s Images/Videos Basing on Tweets

Authors:

Mohamed Kharrat, Anis Jedidi and Faiez Gargouri

Abstract: Nowadays, online social network “Twitter” represents a huge source of unrefined information in various formats (text, video, photo), especially during events and abnormal cases/incidents. New features for Twitter mobile application are now available, allowing user to publish direct photos online. This paper is focusing on photos/videos taken by user and published in real time using only mobile devices. The aim is to find candidates for annotation from Tweet stream, then to annotate them by taking into accounts several features based only on tweets. A preprocessing step is necessary to exclude all useless tweets, we then process textual content of the rest. As a final step, we consider an additional characterization (spatiotemporal and saliency) to get outcome of the annotation as RDF triples.

Paper Nr: 56
Title:

Comparing Summarisation Techniques for Informal Online Reviews

Authors:

Mhairi McNeill, Robert Raeside, Martin Graham and Isaac Roseboom

Abstract: In this paper we evaluate three methods for summarising game reviews written in a casual style. This was done in order to create a review summarisation system to be used by clients of deltaDNA. We look at one well-known method based on natural language processing, and describe two statistical methods that could be used for summarisation: one based on TF-IDF scores another using supervised latent Dirichlet allocation. We find, due to the informality of these online reviews, that natural language based techniques work less well than they do on other types of reviews, and we recommend using techniques based on the statistical properties of the words’ frequencies. In particular, we decided to use a TF-IDF score based system in the final system.

Paper Nr: 59
Title:

Learning Text Patterns to Detect Opinion Targets

Authors:

Filipa Peleja and João Magalhães

Abstract: Exploiting sentiment relations to capture opinion targets has recently caught the interest of many researchers. In many cases target entities are themselves part of the sentiment lexicon creating a loop from which it is difficult to infer the overall sentiment to the target entities. In the present work we propose to detect opinion targets by extracting syntactic patterns from short-texts. Experiments show that our method was able to successfully extract 1,879 opinion targets from a total of 2,052 opinion targets. Furthermore, the proposed method obtains comparable results to SemEval 2015 opinion target models in which we observed the syntactic structure relation that exists between sentiment words and their target.

Paper Nr: 67
Title:

Early Diagnosis of Alzheimer's Disease using Machine Learning Techniques - A Review Paper

Authors:

Aunsia Khan and Muhammad Usman

Abstract: Alzheimer’s, an irreparable brain disease, impairs thinking and memory while the aggregate mind size shrinks which at last prompts demise. Early diagnosis of AD is essential for the progress of more prevailing treatments. Machine learning (ML), a branch of artificial intelligence, employs a variety of probabilistic and optimization techniques that permits PCs to gain from vast and complex datasets. As a result, researchers focus on using machine learning frequently for diagnosis of early stages of AD. This paper presents a review, analysis and critical evaluation of the recent work done for the early detection of AD using ML techniques. Several methods achieved promising prediction accuracies, however they were evaluated on different pathologically unproven data sets from different imaging modalities making it difficult to make a fair comparison among them. Moreover, many other factors such as pre-processing, the number of important attributes for feature selection, class imbalance distinctively affect the assessment of the prediction accuracy. To overcome these limitations, a model is proposed which comprise of initial pre-processing step followed by imperative attributes selection and classification is achieved using association rule mining. Furthermore, this proposed model based approach gives the right direction for research in early diagnosis of AD and has the potential to distinguish AD from healthy controls.

Paper Nr: 69
Title:

A Cloud Platform for Classification and Resource Management of Complex Electromagnetic Problems

Authors:

Andreas Kapsalis, Panagiotis Kasnesis, Panagiotis C. Theofanopoulos, Panagiotis K. Gkonis, Christos Lavranos, Dimitra Kaklamani, Iakovos S. Venieris and George Kyriacou

Abstract: Most scientific applications tend to have a very resource demanding nature and the simulation of such scientific problems often requires a prohibitive amount of time to complete. Distributed computing offers a solution by segmenting the application into smaller processes and allocating them to a cluster of workers. This model was widely followed by Grid Computing. However, Cloud Computing emerges as a strong alternative by offering reliable solutions for resource demanding applications and workflows that are of scientific nature. In this paper we propose a Cloud Platform that supports the simulation of complex electromagnetic problems and incorporates classification (SVM) and resource allocation (Ant Colony Optimization) methods for the effective management of these simulations.

Paper Nr: 71
Title:

EasySDM - An Integrated and Easy to Use Spatial Data Mining Platform

Authors:

Leila Hamdad, Amine Abdaoui, Nabila Belattar and Mohamed Al Chikha

Abstract: Spatial Data Mining allows users to extract implicit but valuable knowledge from spatial related data. Two main approaches have been used in the literature. The first one applies simple Data Mining algorithms after a spatial pre-processing step. While the second one consists of developing specific algorithms that considers the spatial relations inside the mining process. In this work, we first present a study of existing Spatial Data Mining tools according to the implemented tasks and specific characteristics. Then, we illustrate a new open source Spatial Data Mining platform (EasySDM) that integrates both approaches (pre-processing and dynamic mining). It proposes a set of algorithms belonging to clustering, classification and association rule mining tasks. Moreover and more importantly, it allows geographic visualization of both the data and the results. Either via an internal map display or using any external Geographic Information System.

Paper Nr: 72
Title:

Arabic Sentiment Analysis using WEKA a Hybrid Learning Approach

Authors:

Sarah Alhumoud, Tarfa Albuhairi and Mawaheb Altuwaijri

Abstract: Data has become the currency of this era and it is continuing to massively increase in size and generation rate. Large data generated out of organisations’ e-transactions or individuals through social networks could be of a great value when analysed properly. This research presents an implementation of a sentiment analyser for Twitter’s tweets which is one of the biggest public and freely available big data sources. It analyses Arabic, Saudi dialect tweets to extract sentiments toward a specific topic. It used a dataset consisting of 3000 tweets collected from Twitter. The collected tweets were analysed using two machine learning approaches, supervised which is trained with the dataset collected and the proposed hybrid learning which is trained on a single words dictionary. Two algorithms are used, Support Vector Machine (SVM) and K-Nearest Neighbors (KNN). The obtained results by the cross validation on the same dataset clearly confirm the superiority of the hybrid learning approach over the supervised approach.

Paper Nr: 74
Title:

Hybrid Sentiment Analyser for Arabic Tweets using R

Authors:

Sarah Alhumoud, Tarfa Albuhairi and Wejdan Alohaideb

Abstract: Harvesting meaning out of massively increasing data could be of great value for organizations. Twitter is one of the biggest public and freely available data sources. This paper presents a Hybrid learning implementation to sentiment analysis combining lexicon and supervised approaches. Analysing Arabic, Saudi dialect Twitter tweets to extract sentiments toward a specific topic. This was done using a dataset consisting of 3000 tweets collected in three domains. The obtained results confirm the superiority of the hybrid learning approach over the supervised and unsupervised approaches.

Paper Nr: 80
Title:

Knowledge Discovery and Modeling based on Conditional Fuzzy Clustering with Interval Type-2 Fuzzy

Authors:

Yeong-Hyeon Byeon and Keun-Chang Kwak

Abstract: This paper is concerned with a method for designing improved Linguistic Model (LM) using Conditional Fuzzy Clustering(CFC) with two different Interval Type-2 (IT2) fuzzy approaches. The fuzzification factor and contexts with IT2 fuzzy approach are used to deal with uncertainty of clustering,. This proposed clustering technique has characteristics that estimate the prototypes by preserving the homogeneity between the clustered patterns from the IT2-based contexts, and controls the amount of fuzziness of fuzzy c-partition. Thus, the proposed method can represent a nonlinear and complex characteristic more effectively than conventional LM. The experimental partial results on coagulant dosing process in a water purification plant revealed that the proposed method showed a better performance in comparison to the previous works.

Paper Nr: 85
Title:

Customer Tracking Systems based on Identifiers of Mobile Phones

Authors:

D. M. Mikhaylov, A. V. Zuykov, S. M. Kharkov, S. V. Ponomarev, S. V. Dvoryankin and A. M. Tolstaya

Abstract: Gathering statistics about visitors finds more and more applications in various fields of business and commerce. This paper describes the system of impersonal counting of unique visitors by their mobile identifiers. Counting is carried out using a non-functioning communication cell (the system does not provide communication services to users of mobile networks). The system masks itself as the base station of a mobile operator. Mobile devices automatically connect to the system even in case of a strong signal from the towers of mobile operators. Once connected, the user identification data is received. The proposed solution allows to compare data about the number of visits to a particular site in various periods of time and to identify the re-occurrence of the visitors. The system is inexpensive and shows 99% accuracy in the identification of users (compared to the real data about the visitors).

Paper Nr: 91
Title:

A New Approach for Collaborative Filtering based on Bayesian Network Inference

Authors:

Loc Nguyen

Abstract: Collaborative filtering (CF) is one of the most popular algorithms, for recommendation in cases, the items which are recommended to users, have been determined by relying on the outcomes done on surveying their communities. There are two main CF-approaches, which are memory-based and model-based. The model-based approach is more dominant by real-time response when it takes advantage of inference mechanism in recommendation task. However the problem of incomplete data is still an open research and the inference engine is being improved more and more so as to gain high accuracy and high speed. I propose a new model-based CF based on applying Bayesian network (BN) into reference engine with assertion that BN is an optimal inference model because BN is user’s purchase pattern and Bayesian inference is evidence-based inferring mechanism which is appropriate to rating database. Because the quality of BN relies on the completion of training data, it gets low if training data have a lot of missing values. So I also suggest an average technique to fill in missing values.

Paper Nr: 94
Title:

New Classification Models for Detecting Hate and Violence Web Content

Authors:

Shuhua Liu and Thomas Forss

Abstract: Today, the presence of harmful and inappropriate content on the web still remains one of the most primary concerns for web users. Web classification models in the early days are limited by the methods and data available. In our research we revisit the web classification problem with the application of new methods and techniques for text content analysis. Our recent studies have indicated the promising potential of combing topic analysis and sentiment analysis in web content classification. In this paper we further explore new ways and methods to improve and maximize classification performance, especially to enhance precision and reduce false positives, thorough examination and handling of the issues with class imbalance, and through incorporation of LDA topic models.

Paper Nr: 95
Title:

Multiple Behavioral Models: A Divide and Conquer Strategy to Fraud Detection in Financial Data Streams

Authors:

Roberto Saia, Ludovico Boratto and Salvatore Carta

Abstract: The exponential and rapid growth of the E-commerce based both on the new opportunities offered by the Internet, and on the spread of the use of debit or credit cards in the online purchases, has strongly increased the number of frauds, causing large economic losses to the involved businesses. The design of effective strategies able to face this problem is however particularly challenging, due to several factors, such as the heterogeneity and the non stationary distribution of the data stream, as well as the presence of an imbalanced class distribution. To complicate the problem, there is the scarcity of public datasets for confidentiality issues, which does not allow researchers to verify the new strategies in many data contexts. Differently from the canonical state-of-the-art strategies, instead of defining a unique model based on the past transactions of the users, we follow a Divide and Conquer strategy, by defining multiple models (user behavioral patterns), which we exploit to evaluate a new transaction, in order to detect potential attempts of fraud. We can act on some parameters of this process, in order to adapt the models sensitivity to the operating environment. Considering that our models do not need to be trained with both the past legitimate and fraudulent transactions of a user, since they use only the legitimate ones, we can operate in a proactive manner, by detecting fraudulent transactions that have never occurred in the past. Such a way to proceed also overcomes the data imbalance problem that afflicts the machine learning approaches. The evaluation of the proposed approach is performed by comparing it with one of the most performant approaches at the state of the art as Random Forests, using a real-world credit card dataset.

Paper Nr: 96
Title:

Automatic Tag Extraction from Social Media for Visual Labeling

Authors:

Shuhua Liu and Thomas Forss

Abstract: Visual labeling or automated visual annotation is of great importance to the efficient access and management of multimedia content. Many methods and techniques have been proposed for image annotation in the last decade and they have shown reasonable performance on standard datasets. Great progress has been made especially in recent couple of years with the development of deep learning models for image content analysis and extraction of content-based concept labels. However, concept objects labels are much more friendly to machine than to users. We consider that more relevant and user-friendly visual labels need to include “context” descriptors. In this study we explore the possibilities to leverage social media content as a resource for visual labeling. We developed a tag extraction system that applies heuristic rules and term weighting method to extract image tags from associated Tweet. The system retrieves tweet-image pairs from public Twitter accounts, analyzes the Tweet, and generates labels for the images. We elaborate on different visual labeling methods, tag analysis and tag refinement methods.

Paper Nr: 99
Title:

Overlapping Kernel-based Community Detection with Node Attributes

Authors:

Daniele Maccagnola, Elisabetta Fersini, Rabah Djennadi and Enza Messina

Abstract: Community Detection is a fundamental task in the field of Social Network Analysis, extensively studied in literature. Recently, some approaches have been proposed to detect communities distinguishing their members between kernel that represents opinion leaders, and auxiliary who are not leaders but are linked to them. However, these approaches suffer from two important limitations: first, they cannot identify overlapping communities, which are often found in social networks (users are likely to belong to multiple groups simultaneously); second, they cannot deal with node attributes, which can provide important information related to community affiliation. In this paper we propose a method to improve a well-known kernel-based approach named Greedy-WeBA (Wang et al., 2011) and overcome these limitations. We perform a comparative analysis on three social network datasets, Wikipedia, Twitter and Facebook, showing that modeling overlapping communities and considering node attributes strongly improves the ability of detecting real social network communities.

Paper Nr: 106
Title:

Learning Query Expansion from Association Rules Between Terms

Authors:

Ahlem Bouziri, Chiraz Latiri, Eric Gaussier and Yassin Belhareth

Abstract: Query expansion technique offers an interesting solution for obtaining a complete answer to a user query while preserving the quality of retained documents. This mainly relies on an accurate choice of the added terms to an initial query. In this paper, we attempt to use data mining methods to extract dependencies between terms, namely a generic basis of association rules between terms. Face to the huge number of derived association rules and in order to select the optimal combination of query terms from the generic basis, we propose to model the problem as a classification problem and solve it using a supervised learning algorithm. For this purpose, we first generate a training set using a genetic algorithm based approach that explores the association rules space in order to find an optimal set of expansion terms, improving the MAP of the search results, we then build a model able to predict which association rules are to be used when expanding a query. The experiments were performed on SDA 95 collection, a data collection for information retrieval. The main observation is that the hybridization of textmining techniques and query expansion in an intelligent way allows us to incorporate the good features of all of them. As this is a preliminary attempt in this direction, there is a large scope for enhancing the proposed method.

Paper Nr: 107
Title:

Usearch: A Meta Search Engine based on a New Result Merging Strategy

Authors:

Tarek Alloui, Imane Boussebough, Allaoua Chaoui, Ahmed Zakaria Nouar and Mohamed Chaouki Chettah

Abstract: Meta Search Engines are finding tools developed for improving the search performance by submitting user queries to multiple search engines and combining the different search results in a unified ranked list. The effectiveness of a Meta search engine is closely related to the result merging strategy it employs. But nowadays, the main issue in the conception of such systems is the merging strategy of the returned results. With only the user query as relevant information about his information needs, it’s hard to use it to find the best ranking of the merged results. We present in this paper a new strategy of merging multiple search engine results using only the user query as a relevance criterion. We propose a new score function combining the similarity between user query and retrieved results and the users’ satisfaction toward used search engines. The proposed Meta search engine can be used for merging search results of any set of search engines.

Paper Nr: 109
Title:

Distributed Data Replication and Access Optimization for LHCb Storage System - A Position Paper

Authors:

Mikhail Hushchyn, Philippe Charpentier and Andrey Ustyuzhanin

Abstract: This paper presents how machine learning algorithms and methods of statistics can be implemented to data management in hybrid data storage systems. Basicly, two different storage types are used to store data in the hybrid data storage systems. Keeping rarely used data on cheap and slow storages of type one and often used data on fast and expensive storages of type two helps to achieve optimal performance/cost ratio for the system. We use classification algorithms to estimate probability that the data will often used in future. Then, using the risks analysis we define where the data should be stored. We show how to estimate optimal number of replicas of the data using regression algorithms and Hidden Markov Model. Based on the probability, risks and the optimal nuber of data replicas our system finds optimal data distribution in the hybrid data storage system. We present the results of simulation of our method for LHCb hybrid data storage.