KDIR 2012 Abstracts


Full Papers
Paper Nr: 12
Title:

Categorization of Very Short Documents

Authors:

Mika Timonen

Abstract: Categorization of very short documents has become an important research topic in the field of text mining. Twitter status updates and market research data form an interesting corpus of documents that are in most cases less than 20 words long. Short documents have one major characteristic that differentiate them from traditional longer documents: each word occurs usually only once per document. This is called the TF=1 challenge. In this paper we conduct a comprehensive performance comparison of the current feature weighting and categorization approaches using corpora of very short documents. In addition, we propose a novel feature weighting approach called Fragment Length Weighted Category Distribution that takes the challenges of short documents into consideration. The proposed approach is based on previous work on Bi-Normal Separation and on short document categorization using a Naive Bayes classifier. We compare the performance of the proposed approach against several traditional approaches including Chi-Squared, Mutual Information, Term Frequency-Inverse Document Frequency and Residual Inverse Document Frequency. We also compare the performance of a Support Vector Machine classifier against other classification approaches such as k-Nearest Neighbors and Naive Bayes classifiers.

Paper Nr: 29
Title:

Finite Belief Fusion Model for Hidden Source Behavior Change Detection

Authors:

Eugene Santos Jr., Qi Gu, Eunice E. Santos and John Korah

Abstract: A person’s beliefs and attitudes may change multiple times as they gain additional information/perceptions from various external sources, which in turn, may affect their subsequent behavior. Such influential sources, however, are often invisible to the public due to a variety of reasons – private communications, what one randomly reads or hears, and implicit social hierarchies, to name a few. Many efforts have focused on detecting distribution variations. However, the underlying reason for the variation has yet to be fully studied. In this paper, we present a novel approach and algorithm to detect such hidden sources, as well as capture and characterize the patterns of their impact with regards to the belief-changing trend. We formalize this problem as a finite belief fusion model and solve it via an optimization method. Finally, we compare our work with general mixture models, e.g. Gaussian Mixture Model. We present promising preliminary results obtained from proof-of-concept experiments conducted on both synthetic data and a real-world scenario.

Paper Nr: 35
Title:

Model Selection and Stability in Spectral Clustering

Authors:

Zeev Volkovich and Renata Avros

Abstract: An open problem in spectral clustering concerning of finding automatically the number of clusters is studied. We generalize the method for the scale parameter selecting offered in the Ng-Jordan-Weiss (NJW) algorithm and reveal a connection with the distance learning methodology. Values of the scaling parameter estimated via clustering of samples drawn are considered as a cluster stability attitude such that the clusters quantity corresponding to the most concentrated distribution is accepted as “true” number of clusters. Numerical experiments provided demonstrate high potential ability of the offered method.

Paper Nr: 37
Title:

Infinite Topic Modelling for Trend Tracking - Hierarchical Dirichlet Process Approaches with Wikipedia Semantic based Method

Authors:

Yishu Miao, Chunping Li, Hui Wang and Lu Zhang

Abstract: The current affairs people concern closely vary in different periods and the evolution of trends corresponds to the reports of medias. This paper considers tracking trends by incorporating non-parametric Bayesian approaches with temporal information and presents two topic modelling methods. One utilizes an infinite temporal topic model which obtains the topic distribution over time by placing a time prior when discovering topics dynamically. In order to better organize the event trend, we present another progressive superposed topic model which simulates the whole evolutionary processes of topics, including new topics’ generation, stable topics’ evolution and old topics’ vanishment, via a series of superposed topics distribution generated by hierarchical Dirichlet process. Both of the two approaches aim at solving the real-world task while avoiding Markov assumption and breaking the number limitation of topics. Meanwhile, we employ Wikipedia based semantic background knowledge to improve the discovered topics and their readability. The experiments are carried out on the corpus of BBC news about American Forum. The results demonstrate better organized topics, evolutionary processes of topics over time and model effectiveness.

Paper Nr: 41
Title:

A Semi-supervised Learning Framework to Cluster Mixed Data Types

Authors:

Artur Abdullin and Olfa Nasraoui

Abstract: We propose a semi-supervised framework to handle diverse data formats or data with mixed-type attributes. Our preliminary results in clustering data with mixed numerical and categorical attributes show that the proposed semi-supervised framework gives better clustering results in the categorical domain. Thus the seeds obtained from clustering the numerical domain give an additional knowledge to the categorical clustering algorithm. Additional results show that our approach has the potential to outperform clustering either domain on its own or clustering both domains after converting them to the same target domain.

Paper Nr: 42
Title:

User based Collaborative Filtering with Temporal Information for Purchase Data

Authors:

Maunendra Sankar Desarkar and Sudeshna Sarkar

Abstract: User based collaborative filtering algorithms are widely used for generating recommendations for users. Standard user based collaborative filtering algorithms do not consider time as a factor while measuring the user similarities and building the recommendation list. However, users’ interests often shift with time. Recommender systems should therefore rely on recent purchases of the users to address this user dynamics. Items also have their own dynamics. Most of the items in a recommender system are widely popular just after their releases but do not sell that well afterwards. Giving more importance to the recent purchases of the experts may capture the item dynamics and hence result in better recommendation accuracy. We study the performances of different time-aware user based collaborative filtering algorithms on several benchmark datasets. The proposed algorithms use the time-of-purchase information for calculating user similarities. The time information is also used while combining the purchase behaviors of the experts and generating the final recommendation.

Paper Nr: 46
Title:

Guided Exploratory Search on the Mobile Web

Authors:

Günter Neumann and Sven Schmeier

Abstract: We present a mobile touchable application for guided exploration of web content and online topic graph extraction that has been successfully implemented on a tablet, i.e. an Apple iPad, and on a mobile device/phone, i.e. Apple iPhone or iPod. Starting from a user’s search query a set of web snippets is collected by a standard search engine in a first step. After that the snippets are collected into one document from which the topic graph is computed. This topic graph is presented to the user in different touchable and interactive graphical representations depending on the screensize of the mobile device. However due to possible semantic ambiguities in the search queries the snippets may cover different thematic areas and so the topic graph may contain associated topics for different semantic entities of the original query. This may lead the user to wrong directions while exploring the solution space. Hence we present our approach for an interactive disambiguation of the search query and so we provide assistance for the users towards a guided exploratory search.

Paper Nr: 55
Title:

Probabilistic Sequence Modeling for Recommender Systems

Authors:

Nicola Barbieri, Antonio Bevacqua, Marco Carnuccio, Giuseppe Manco and Ettore Ritacco

Abstract: Probabilistic topic models are widely used in different contexts to uncover the hidden structure in large text corpora. One of the main features of these models is that generative process follows a bag-of-words assumption, i.e each token is independent from the previous one. We extend the popular Latent Dirichlet Allocation model by exploiting a conditional Markovian assumptions, where the token generation depends on the current topic and on the previous token. The resulting model is capable of accommodating temporal correlations among tokens, which better model user behavior. This is particularly significant in a collaborative filtering context, where the choice of a user can be exploited for recommendation purposes, and hence a more realistic and accurate modeling enables better recommendations. For the mentioned model we present a fast Gibbs Sampling procedure for the parameters estimation. A thorough experimental evaluation over real-word data shows the performance advantages, in terms of recall and precision, of the proposed sequence-modeling approach.

Paper Nr: 58
Title:

A Unified Approach for Context-sensitive Recommendations

Authors:

Mihaela Dinsoreanu, Florin Cristian Macicasan, Octavian Lucian Hasna and Rodica Potolea

Abstract: We propose a model capable of providing context-sensitive content based on the similarity between an analysed context and the recommended content. It relies on the underlying thematic structure of the context by means of lexical and semantic analysis. For the context, we analyse both the static characteristics and dynamic evolution. The model has a high degree of generality by considering the whole range of possible recommendations (content) which best fits the current context. Based on the model, we have implemented a system dedicated to contextual advertisements for which the content is the ad while the context is represented by a web page visited by a given user. The dynamic component refers to the changes of the user’s interest over time. From all the composite criteria the system could accept for assessing the quality of the result, we have considered relevance and diversity. The design of the model and its ensemble underlines our original view on the problem. From the conceptual point of view, the unified thematic model and its category based organization are original concepts together with the implementation.

Paper Nr: 66
Title:

String Searching in Referentially Compressed Genomes

Authors:

Sebastian Wandelt and Ulf Leser

Abstract: Background: Improved sequencing techniques have led to large amounts of biological sequence data. One of the challenges in managing sequence data is efficient storage. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. However, so far sequences always have to be decompressed prior to an analysis. There is a need for algorithms working on compressed data directly, avoiding costly decompression. Summary: In our work, we address this problem by proposing an algorithm for exact string search over compressed data. The algorithm works directly on referentially compressed genome sequences, without needing an index for each genome and only using partial decompression. Results: Our string search algorithm for referentially compressed genomes performs exact string matching for large sets of genomes faster than using an index structure, e.g. suffix trees, for each genome, especially for short queries. We think that this is an important step towards space and runtime efficient management of large biological data sets.

Paper Nr: 72
Title:

Robust Template Identification of Scanned Documents

Authors:

Xiaofan Feng, Abdou Youssef and Sithu Sudarsan

Abstract: Identification of low-quality scanned documents is not trivial in real-world settings. Existing research mainly focusing on similarity-based approaches rely on perfect string data from a document. Also, studies using image processing techniques for document identification rely on clean data and large differences among templates. Both these approaches fail to maintain accuracy in the context of noisy data or when document templates are too similar to each other. In this paper, a probabilistic approach is proposed to identify the document template of scanned documents. The proposed algorithm works on imperfect OCR output and document collections containing very similar templates. Through experiment and analysis, this novel probabilistic approach is shown to achieve high accuracy on different data sets.

Paper Nr: 81
Title:

Formal Concept Analysis for the Interpretation of Relational Learning Applied on 3D Protein-binding Sites

Authors:

Emmanuel Bresso, Renaud Grisoni, Marie-Dominique Devignes, Amedeo Napoli and Malika Smaïl-Tabbone

Abstract: Inductive Logic Programming (ILP) is a powerful learning method which allows an expressive representation of the data and produces explicit knowledge in the form of a theory, i.e., a set of first-order logic rules. However, ILP systems suffer from a drawback as they return a single theory based on heuristic user-choices of various parameters, thus ignoring potentially relevant rules. Accordingly, we propose an original approach based on Formal Concept Analysis for effective interpretation of reached theories with the possibility of adding domain knowledge. Our approach is applied to the characterization of three-dimensional (3D) protein-binding sites which are the protein portions on which interactions with other proteins take place. In this context, we define a relational and logical representation of 3D patches and formalize the problem as a concept learning problem using ILP. We report here the results we obtained on a particular category of protein-binding sites namely phosphorylation sites using ILP followed by FCA-based interpretation.

Short Papers
Paper Nr: 5
Title:

Extracting Structure, Text and Entities from PDF Documents of the Portuguese Legislation

Authors:

Nuno Moniz and Fátima Rodrigues

Abstract: This paper presents an approach for text processing of PDF documents with well-defined layout structure. The scope of the approach is to explore the font’s structure of PDF documents, using perceptual grouping. It consists on the extraction of text objects from the content stream of the documents and its grouping according to a set criterion, making also use of geometric-based regions in order to achieve the correct reading order. The developed approach processes the PDF documents using logical and structural rules to extract the entities present in them, and returns an optimized XML representation of the PDF document, useful for re-use, for example in text categorization. The system was trained and tested with Portuguese Legislation PDF documents extracted from the electronic Republic’s Diary. Evaluation results show that our approach presents good results.

Paper Nr: 6
Title:

Comparative Study of Collaborative Filtering Algorithms

Authors:

Bharat Bhasker

Abstract: This paper compares the performance of entropy based collaborative filtering (EBCF) algorithm under two different circumstances viz. entropy based user-user CF and entropy based item-item CF. The entropy based user-user CF algorithm is first modified to entropy based item-item CF to make use of the relatively static relationship between items. In addition to exploiting the predictable portions of even some complex relationships between items while selecting the mentors for the target item, the entropy based item-item CF incurs less online computation. Both the algorithms have been tested using the MovieLens dataset. The experimental results show that the entropy-based item-item CF algorithm provides better recommendation quality and accuracy than entropy based user-user CF and achieves coverage comparable to the entropy based user-user CF algorithm. Therefore, entropy based item-item CF algorithm can be used in e-commerce applications where accuracy is more important than coverage.

Paper Nr: 7
Title:

The Difficulty of Path Traversal in Information Networks

Authors:

Frank W. Takes and Walter A. Kosters

Abstract: This paper introduces a set of classification techniques for determining the difficulty — for a human — of path traversal in an information network. In order to ensure the generalizability of our approach, we do not use ontologies or concepts of expected semantic relatedness, but rather focus on local and global structural graph properties and measures to determine the difficulty of finding a certain path. Using a large corpus of over two million traversed paths on Wikipedia, we demonstrate how our techniques are able to accurately assess the human difficulty of finding a path between two articles within an information network.

Paper Nr: 10
Title:

Unsupervised Discovery of Significant Candlestick Patterns for Forecasting Security Price Movements

Authors:

Karsten Martiny

Abstract: Candlestick charts are a visually appealing method of presenting price movements of securities. It has been developed in Japan centuries ago. The depiction of movements as candlesticks tends to exhibit recognizable patterns that allow for predicting future price movements. Common approaches of employing candlestick analysis in automatic systems rely on a manual a-priori specification of well-known patterns and infer prognoses upon detection of such a pattern in the input data. A major drawback of this approach is that the performance of such a system is limited by the quality and quantity of the predefined patterns. This paper describes a novel method of automatically discovering significant candlestick patterns from a time series of price data and thereby allows for an unsupervised machine-learning task of predicting future price movements.

Paper Nr: 11
Title:

A New Query Suggestion Algorithm for Taxonomy-based Search Engines

Authors:

Roberto Zanon, Simone Albertini, Moreno Carullo and Ignazio Gallo

Abstract: The objective of this work is the realization of an algorithm to provide a query suggestion feature in order to support the search engine of a commercial web site. Starting from web server logs, our solution creates a model analyzing the queries submitted by the users. Given a submitted query, the system searches the most adequate queries to suggest. Our method implements an already known session based proposal enriching it by exploiting specific information available in the current context: the category the user is browsing on the web site and a solution to overcome the limits of a pure session based approach considering also similarity between queries. Quantitative and qualitative experiments show that the proposed model is suitable in terms of resources employed and user’s satisfaction degree.

Paper Nr: 13
Title:

Interactive Visualization of a News Clips Network - A Journalistic Research and Knowledge Discovery Tool

Authors:

José Devezas and Álvaro Figueira

Abstract: Interactive visualization systems are powerful tools in the task of exploring and understanding data. We describe two implementations of this approach, where a multidimensional network of news clips is depicted by taking advantage of its community structure. The first implementation is a multiresolution map of news clips that uses topic detection both at the clip level and at the community level, in order to assign labels to the nodes in each resolution. The second implementation is a traditional force-directed network visualization with several additional interactive aspects that provide a rich user experience for knowledge discovery. We describe a common use case for the visualization systems as a journalistic research and knowledge discovery tool. Both systems illustrate the links between news clips, induced by the co-occurrence of named entities, as well as several metadata fields based on the information contained within each node.

Paper Nr: 16
Title:

Contextual Approaches for Identification of Toponyms in Ancient Documents

Authors:

Hendrik Schöneberg and Frank Müller

Abstract: Performing Named Entity Recognition on ancient documents is a time-consuming, complex and error-prone manual task. It is a prerequisite though to being able to identify related documents and correlate between named entities in distinct sources, helping to precisely recreate historic events. In order to reduce the manual effort, automated classification approaches could be leveraged. Classifying terms in ancient documents in an automated manner poses a difficult task due to the sources’ challenging syntax and poor conservation states. This paper introduces and evaluates two approaches that can cope with complex syntactial environments by using statistical information derived from a term’s context and combining it with domain-specific heuristic knowledge to perform a classification. Furthermore, these approaches can easily be adapted to new domains.

Paper Nr: 19
Title:

Lanna Dharma Printed Character Recognition using k-Nearest Neighbor and Conditional Random Fields

Authors:

Chutima Chueaphun, Atcharin Klomsae, Sanparith Marukatat and Jeerayut Chaijaruwanich

Abstract: For centuries, in the North of Thailand, many books of Lanna Dharma characters had been printed. These books are the important sources of the knowledge of ancient Lanna wisdom. At present, the books are found old and damaged. Most of characters are rough and not clear according to its early printing technology at that time. Moreover, some sets of characters are relatively very similar which cause the difficulty to recognize them. This paper proposes a Lanna Dharma printed character recognition technique using k-Nearest Neighbor and Conditional Random Fields. The accuracy of recognition rate is about 82.61 percent.

Paper Nr: 21
Title:

Factor Analysis and the Retrieval of Medical Images Depicting Structures with Similar Shapes

Authors:

Alexei Manso Correa Machado

Abstract: This work presents a new perspective to medical image retrieval based on factor analysis. The shape of anatomical structures are represented as high-dimensional sets of vector variables obtained from non-rigidly deforming a template image so as to align its anatomy with the subject anatomy of a group. By eliminating the redundancy embedded in the data, a reduced set of factors is determined, corresponding to new variables with possible anatomic significance. The method’s ability to retrieve relevant images is exemplified in a study of the corpus callosum, a structure with very subtle shape differences. The factor analysis approach is compared to principal component analysis in a set of 960 experiments, yielding significantly higher precision rates.

Paper Nr: 23
Title:

Search Result Summaries Improved by Structure and Multimedia

Authors:

Brent Wenerstrom and Mehmed Kantardzic

Abstract: We previously introduced ReClose which provides summaries with both better content and better visual display for search engine results. We now seek to further improve summaries with the addition of structured text and multimedia, more specifically tables, lists, buttons and images. Currently search engine provided summaries rarely use structured text and images. We show in this paper that structured text and images lead to faster comprehension by search engine users and lead to visually more appealing summaries. 70% of non-expert users made decisions more quickly using summaries preserving document structure and 65% of all users preferred summaries preserving structure to plain text summaries.

Paper Nr: 31
Title:

Adaptation of the User Navigation Scheme using Clustering and Frequent Pattern Mining Techiques for Profiling

Authors:

Olatz Arbelaitz, Ibai Gurrutxaga, Aizea Lojo, Javier Muguerza, Jesús M. Pérez and Iñigo Perona

Abstract: There is a need to facilitate access to the required information in the web and adapting it to the users’ preferences and requirements. This paper presents a system that, based on a collaborative filtering approach, adapts the web site to improve the browsing experience of the user: it generates automatically interesting links for new users. The system only uses the web log files stored in any web server (common log format) and builds user profiles from them combining machine learning techniques with a generalization process for data representation. These profiles are later used in an exploitation stage to automatically propose links to new users. The paper examines the effect of the parameters of the system on its final performance. Experiments show that the designed system performs efficiently in a database accessible from the web and that the use of a generalization process, specificity in profiles and the use of frequent pattern mining techniques benefit the profile generation phase, and, moreover, diversity seems to help in the exploitation phase.

Paper Nr: 32
Title:

Collaborative Filtering based on Sentiment Analysis of Guest Reviews for Hotel Recommendation

Authors:

Fumiyo Fukumoto, Chihiro Motegi and Suguru Matsuyoshi

Abstract: Collaborative filtering (CF) is identifies the preference of a consumer/guest for a new product/hotel by using only the information collected from other consumers/guests with similar products/hotels in the database. It has been widely used as filtering techniques because it is not necessary to apply more complicated content analysis. However, it is difficult to take users criteria into account. Some of the item-based collaborative filtering take users preferences or votes for the item into account. One problem of these approaches is a data sparseness problem that the user preferences were not tagged all the items. In this paper, we propose a new recommender method incorporating the results of sentiment analysis of guest reviews. The results obtained by our method using real-world data sets demonstrate a performance improvement compared to the four baselines.

Paper Nr: 36
Title:

Friendship Prediction using Semi-supervised Learning of Latent Features in Smartphone Usage Data

Authors:

Yuka Ikebe, Masaji Katagiri and Haruo Takemura

Abstract: This paper describes a semi-supervised learning method that uses smartphone usage data to identify friendship in the one-class setting. The method is based on the assumption that friends share some interests and their smartphone usage reflects this. The authors combine a supervised link prediction method with matrix factorization which incorporates latent features acquired from the application usage and Internet access. The latent features are optimized jointly with the process of link prediction. Moreover, the method employs the sigmoidal function to estimate user affinities from the polarized latent user features. To validate the method, fifty university students volunteered to have their smartphone usage monitored for 6 months. The results of this empirical study show that the proposal offers higher friendship prediction accuracy than state-of-the-art link prediction methods.

Paper Nr: 38
Title:

Some Empirical Evaluations of a Temperature Forecasting Module based on Artificial Neural Networks for a Domotic Home Environment

Authors:

F. Zamora-Martinez, P. Romeu, J. Pardo and D. Tormo

Abstract: This work presents the empirical evaluation of an indoor temperature prediction module which is integrated in an ambient intelligence control software. This software is running on the SMLhouse, a domotic house built by our university. A study of impact on prediction error of future window size has been performed. We use Artificial Neural Networks models for a multi-step-ahead direct forecasting, using an output size of 60, 120, and 180. Interesting results have been obtained, in the worst case a Mean Absolute Error of 0.223ºC over a validation set, and 0.566ºC over a hard unseen test set. This results inspire the development of an automatic control built over this predictions, that could manage the climate system in order to enhance the comfort and energy efficiency of our house.

Paper Nr: 43
Title:

Constraint-programming Approach for Multiset and Sequence Mining

Authors:

Pablo Gay, Beatriz López and Joaquim Meléndez

Abstract: Constraint-based data mining is a field that recently has started to receive more attention. Describing a problem through a declarative model enables very descriptive and easy to extend implementations. Our work uses a previous itemset mining model in order to extend it with the capabilities to discover different and interesting patterns that have not been explored yet: multisets and sequences. The classic example domain is the retailer organizations, trying to mine the most common combinations of items bought together. Multisets would allow mining not only this itemsets but also the quantities of each item and sequences the order in with the items are retrieved. In this paper, we provide the background of the original work and we describe the modifications done to the model to extend it and support these new patterns. We also test the new models using real world data to prove their feasibility.

Paper Nr: 45
Title:

Are Related Links Effective for Contextual Advertising? - A Preliminary Study

Authors:

Giuliano Armano, Alessandro Giuliani and Eloisa Vargiu

Abstract: Classical contextual advertising systems suggest suitable ads to a given webpage just analyzing its content, without relying on further information. We claim that adding some information extracted by semantically related pages can improve the overall performances. To this end, this paper proposes an experimental study aimed at verifying to which extent the analysis of related links, i.e., inlinks and outlinks, can help contextual advertising. Experiments have been performed on about 15000 webpages extracted by DMoz. Results show that the adoption of related links significantly improves the performance of the adopted baseline system.

Paper Nr: 47
Title:

An Efficient Strategy for Spatio-temporal Data Indexing and Retrieval

Authors:

Antonio d'Acierno, Marco Leone, Alessia Saggese and Mario Vento

Abstract: Moving people’s and objects’ trajectories extracted from video sequences are increasingly assuming a key role for detecting anomalous events and for characterizing human behaviors. Among the key related issues, there is the need of efficiently storing a huge amount of 3D trajectories together with retrieval techniques sufficiently fast to allow a real-time extraction of trajectories satisfying spatio-temporal requirements. Unfortunately, while exist well established solutions for 2D trajectories, theoretical solutions proposed for 3D ones are not widely available in commercial and free spatially enabled DBMS; the paper thus presents a novel method for extending available 2D indexes to 3D data. In particular, starting from a redundant bi-dimensional indexing scheme recently introduced in (d’Acierno et al., 2011), we propose a new retrieval system that, while still using off-the-shelf solutions, avoids almost any redundancy in data to be handled; both the spatial complexity and the retrieval efficiency for time-interval queries have been significantly improved.

Paper Nr: 50
Title:

Parsing and Maintaining Bibliographic References - Semi-supervised Learning of Conditional Random Fields with Constraints

Authors:

Sebastian Lindner and Winfried Höhn

Abstract: This paper shows some key components of our workflow to cope with bibliographic information. We therefore compare several approaches for parsing bibliographic references using conditional random fields (CRFs). This paper concentrates on cases, where there are only few labeled training instances available. To get better labeling results prior knowledge about the bibliography domain is used in training CRFs using different constraint models. We show that our labeling approach is able to achieve comparable and even better results than other state of the art approaches. Afterwards we point out how for about half of our reference strings a correlation between journal title, volume and publishing year could be used to identify the correct journal even when we had ambiguous journal title abbreviations.

Paper Nr: 52
Title:

Exploiting Social Networks for Publication Venue Recommendations

Authors:

Hiep Luong, Tin Huynh, Susan Gauch and Kiem Hoang

Abstract: The impact of a publication venue is a major consideration for researchers and scholars when they are deciding where to publish their research results. By selecting the right conference or journal to which to submit a new paper minimizes the risk of wasting the long review time for a paper that is ultimately rejected. This task also helps to recommend appropriate conference venues of which authors may not be aware or to which colleagues often submit their papers. Traditional ways of scientific publication recommendation using content-based analysis have shown drawbacks due to mismatches caused by ambiguity in text comparisons and there is also much more to selecting an appropriate venue than just topical-matching. In our work, we are taking advantage of actual and interactive relationships within the academic community, as indicated by co-authorship, paper review or event co-organizing activities, to support the venue recommendation process. Specifically, we present a new social network-based approach that automatically finds appropriate publication venues for authors’ research paper by exploring their network of related co-authors and other researchers in the same field. We also recommend appropriate publication venues to a specific user based on her relation with the program committee research activities and with others in her network who have similar paper submission preferences. This paper also presents more accurate and promising results of our social network-based in comparison with the baseline content-based approach. Our experiment, which was empirically tested over a large set of scientific papers published in 16 different ACM conferences, showed that analysing an academic social network would be useful for a variety of recommendation tasks including trend of publications, expert findings, and research collaborations, etc.

Paper Nr: 56
Title:

Automatically Extracting Complex Data Structures from the Web

Authors:

Laura Fontán, Rafael López-García, Manuel Álvarez and Alberto Pan

Abstract: This paper presents a new technique for detecting and extracting lists of structured records from Web pages. With respect to most of the state-of-the-art systems, our approach is capable of detecting nested data structures (sublists) and it also incorporates some heuristics to delete unwanted content such as banners and navigation menus from the data region. This article also describes the experiments we have performed to validate the system. The precision and recall we have obtained in our tests surpass 90%.

Paper Nr: 61
Title:

A Game-Theoretic Framework to Identify Top-K Teams in Social Networks

Authors:

Maryam Sorkhi, Hamidreza Alvari, Sattar Hashemi and Ali Hamzeh

Abstract: Discovering teams of experts in social networks has been receiving the increasing attentions recently. These teams are often formed when a given specific task should be accomplished by the collaboration and the communication of the small number of connected experts and with the minimum communication cost. In this study we propose a game theoretic framework to find top-k teams satisfying such conditions. The importance of finding top-k teams is revealed when the experts of the best discovered team do not have an incentive to work together for any reason and hence we must refer to the next found teams. Finally, the local Nash equilibrium corresponding to the game is reached when all of the teams are formed. The experimental results on DBLP co-authorship graph show the effectiveness and efficiency of the proposed method.

Paper Nr: 83
Title:

BINGR: Binary Search based Gaussian Regression

Authors:

Harshit Dubey, Saket Bharambe and Vikram Pudi

Abstract: Regression is the study of functional dependency of one variable with respect to other variables. In this paper we propose a novel regression algorithm, BINGR, for predicting dependent variable, having the advantage of low computational complexity. The algorithm is interesting because instead of directly predicting the value of the response variable, it recursively narrows down the range in which response variable lies. BINGR reduces the computation order to logarithmic which is much better than that of existing standard algorithms. As BINGR is parameterless, it can be employed by any naive user. Our experimental study shows that our technique is as accurate as the state of the art, and faster by an order of magnitude.

Paper Nr: 85
Title:

An Order-invariant Time Series Distance Measure - Position on Recent Developments in Time Series Analysis

Authors:

Stephan Spiegel and Sahin Albayrak

Abstract: Although there has been substantial progress in time series analysis in recent years, time series distance measures still remain a topic of interest with a lot of potential for improvements. In this paper we introduce a novel Order Invariant Distance measure which is able to determine the (dis)similarity of time series that exhibit similar sub-sequences at arbitrary positions. Additionally, we demonstrate the practicality of the proposed measure on a sample data set of synthetic time series with artificially implanted patterns, and discuss the implications for real-life data mining applications.

Paper Nr: 88
Title:

New Directions in the Analysis of Social Network Dynamics

Authors:

Shahadat Uddin, Simon Reay Atkinson and Liaquat Hossain

Abstract: A significant amount of research endeavour is noticed in present literature to analyse, model, and capture network dynamics. By following a topological approach (i.e., static topology and dynamic topology), we propose a research framework to analyse, model, and capture the evolutionary dynamics of networks. In static topology, SNA methods are applied on the aggregated network of entire data collection period. Smaller segments of network data accumulated in ‘less time’ compared to the entire data collection period are used in dynamic typology for analysis purpose. This paper also posits the relevancy and applicability of our proposed actor-level approach by reviewing the methods of present literature for analysing network dynamics. Finally, this paper briefly describe future research direction in align of our proposed actor-level approach to model, analyse, and capture social network dynamics.

Paper Nr: 93
Title:

Undermining - Social Engineering using Open Source Intelligence Gathering

Authors:

Leslie Ball, Gavin Ewan and Natalie Coull

Abstract: Digital deposits are undergoing exponential growth. These may in turn be exploited to support cyber security initiatives through open source intelligence gathering. Open source intelligence itself is a double-edged sword as the data may be harnessed not only by intelligence services to counter cyber-crime and terrorist activity but also by the perpetrator of criminal activity who use them to socially engineer online activity and undermine their victims. Our preliminary case study shows how the security of any company can be surreptitiously compromised by covertly gathering the open source personal data of the company’s employees and exploiting these in a cyber attack. Our method uses tools that can search, drill down and visualise open source intelligence structurally. It then exploits these data to organise creative spear phishing attacks on the unsuspecting victims who unknowingly activate the malware necessary to compromise the company’s computer systems. The entire process is the covert and virtual equivalent of overtly stealing someone’s password ‘over the shoulder’. A more sophisticated development of this case study will provide a seamless sequence of interoperable computing processes from the initial gathering of employee names to the successful penetration of security measures.

Paper Nr: 96
Title:

Polytope Model for Extractive Summarization

Authors:

Marina Litvak and Natalia Vanetik

Abstract: The problem of text summarization for a collection of documents is defined as the problem of selecting a small subset of sentences so that the contents and meaning of the original document set are preserved in the best possible way. In this paper we present a linear model for the problem of text summarization, where we strive to obtain a summary that preserves the information coverage as much as possible in comparison to the original document set. We construct a system of linear inequalities that describes the given document set and its possible summaries and translate the problem of finding the best summary to the problem of finding the point on a convex polytope closest to the given hyperplane. This re-formulated problem can be solved efficiently with the help of quadratic programming.

Paper Nr: 97
Title:

Enhancing a Web Usage Mining based Tourism Website Adaptation with Content Information

Authors:

Olatz Arbelaitz, Ibai Gurrutxaga, Aizea Lojo, Javier Muguerza, Jesús M. Pérez and Iñigo Perona

Abstract: Websites are important tools for tourism destinations. The adaptation of the websites to the users’ preferences and requirements will turn the websites into more effective tools. Using machine learning techniques to build user profiles allows us to take into account their real preferences. This paper presents the first approach of a system that, based on a collaborative filtering approach, adapts a tourism website to improve the browsing experience of the users: it generates automatically interesting links for new users. In this work we first build a system based just on the usage information stored in web log files (common log format) and then combine it with the web content information to improve the performance of the system. The use of content information not only improves the results but it also offers very useful information about the users’ interests to travel agents.

Paper Nr: 99
Title:

Evidence Accumulation Clustering using Pairwise Constraints

Authors:

João M. M. Duarte, Ana L. N. Fred and F. Jorge F. Duarte

Abstract: Recent work on constrained data clustering have shown that the incorporation of pairwise constraints, such as must-link and cannot-link constraints, increases the accuracy of single run data clustering methods. It was also shown that the quality of a consensus partition, resulting from the combination of multiple data partitions, is usually superior than the quality of the partitions produced by single run clustering algorithms. In this paper we test the effectiveness of adding pairwise constraints to the Evidence Accumulation Clustering framework. For this purpose, a new soft-constrained hierarchical clustering algorithm is proposed and is used for the extraction of the consensus partition from the co-association matrix. It is also studied whether there are advantages in selecting the must-link and cannot-link constraints on certain subsets of the data instead of selecting these constraints at random on the entire data set. Experimental results on 7 synthetic and 7 real data sets have shown the use of soft constraints improves the performance of the Evidence Accumulation Clustering.

Paper Nr: 100
Title:

An Interests Discovery Approach in Social Networks based on a Semantically Enriched Bayesian Network Model

Authors:

Akram Al-Kouz and Sahin Albayrak

Abstract: Knowing the interests of users in Social Networking Systems becomes essential for User Modeling. Interests discovery from user’s posts based on standard text classification techniques such as the Bag Of Words fails to catch the implicit relations between terms. We propose an approach that automatically generates an ordered list of candidate topics of interests given the text of the users’ posts. The approach generate terms and segments, enriches them semantically from world knowledge, and creates a Bayesian Network to model the syntactic and semantic relations. After that it uses probabilistic inference to elect the list of candidate topics of interests which have the highest posterior probability given the explicit and implicit features in user’s posts as observed evidences. A primitive evaluation has been conducted using manually annotated data set consisting of 40 Twitter users. The results showed that our approach outperforms the Bag Of Words technique, and that it has promising indications for effectively detecting interests of users in Social Networking Systems.

Paper Nr: 102
Title:

Addressing the Problem of Unbalanced Data Sets in Sentiment Analysis

Authors:

Asmaa Mountassir, Houda Benbrahim and Ilham Berrada

Abstract: Sentiment Analysis is a research area where the studies focus on processing and analysing the opinions available on the web. This paper deals with the problem of unbalanced data sets in supervised sentiment classification. We propose three different methods to under-sample the majority class documents, namely Remove Similar, Remove Farthest and Remove by Clustering. Our goal is to compare the effectiveness of the proposed methods with the common random under-sampling. We use for classification three standard classifiers: Naïve Bayes, Support Vector Machines and k-Nearest Neighbours. The experiments are carried out on two different Arabic data sets that we have built and labelled manually. We show that results obtained on the first data set, which is slightly skewed, are better than those obtained on the second one which is highly skewed. The results show also that we can rely on the proposed techniques and that they are typically competitive with random under-sampling.

Posters
Paper Nr: 8
Title:

Predicting the Efficiency with Knowledge Discovery of a Budgeted Company: A Cuban University - Validation through Three Semesters

Authors:

Libia I. García, Isel Grau and Ricardo Grau

Abstract: The efficiency analysis of a company cannot be reduced to a great number of statistical tables, despite the reliability of them. It has been shown to be a better idea to seek the “essence” using Knowledge Discovery (KD) techniques. In this paper, a simple methodology to apply KD in the efficiency analysis of a budgeted company is presented. These analyses complement those from classical OLAP and Interactive Graphics. Specifically, it is shown how to use in three steps: univariate analysis, non-supervised and supervised multi-variate machine learning techniques in order to support the decision making. All these procedures are illus-trated using the SIGENU database (in Spanish: Sistema de Gestión de la Nueva Universidad) of UCLV stu-dents and the efficiency measure was the student graduation on time or not. The presented methodology was elaborated in 2009 and it has been preliminary validated during three semesters of the years 2010 to 2012.

Paper Nr: 9
Title:

Product Assortment Decisions for a Network of Retail Stores using Data Mining with Optimization

Authors:

Sudip Bhattacharjee, Fidan Boylu and Ram Gopal

Abstract: This paper presents a model for product assortment optimization for a network of retail stores operating in various locations of a company. Driven by the local market information of each retail store, the model determines the right products to include in a store’s assortment and which stores to ship from in the store network. The model first learns the global patterns of the frequent itemsets based on association rule mining to extract patterns of products with corresponding sales benefits. It then encodes the pattern information into the development of a global optimization formulation, which maximizes the revenue of the company in aggregate and identifies the optimal solution for each local store by taking into account the possibility of shipments in the network. We use the transactional level data from an industry leading plastics manufacturer and retailer in the United States to demonstrate the utility of the model.

Paper Nr: 24
Title:

Frequent and Significant Episodes in Sequences of Events - Computation of a New Frequency Measure based on Individual Occurrences of the Events

Authors:

Oscar Quiroga, Joaquim Meléndez and Sergio Herraiz

Abstract: Pattern discovery in event sequences is based on the mining of frequent episodes. Patterns are the result of the assessment of frequent episodes using episode rules. However, with a simple search usually a huge number of frequent episodes and rules are found, then, methods to recognise the most significant patterns and to properly measure the frequency of the episodes, are required. In this paper, two new indexes called cohesion and backward-confidence of the episodes are proposed to help in the extraction of significant patterns. Also, two methods to find the maximal number of non-redundant occurrences of serial and parallel episodes are presented. Experimental results demonstrate the compactness of the mining result and the efficiency of our mining algorithms.

Paper Nr: 44
Title:

Detecting Temporally Related Arithmetical Patterns - An Extension of Complex Event Processing

Authors:

Ronald de Haan and Mikhail Roshchin

Abstract: When modelling diagnostic knowledge on technical systems it is often important to be able to capture complex events with a temporal structure, based on arithmetical patterns in (preprocessed) sensor data. With the current methods, such a combination is not easily possible. To solve this problem, we devise an extension of complex event processing methods, by designing a declarative language to specify the generation of events with a complex temporal structure that are based on arithmetical patterns in numerical data. This extension furthermore makes complex event processing methods more easily accessible for users that have no experience with complex event processing.

Paper Nr: 51
Title:

Using Neighborhood Pre-computation to Increase Recommendation Efficiency

Authors:

Vreixo Formoso, Diego Fernández, Fidel Cacheda and Victor Carneiro

Abstract: Collaborative filtering is a very popular recommendation technique. Among the different approaches, the k- Nearest Neighbors algorithm stands out by its simplicity, and its good and explainable results. This algorithm bases its recommendations to a given user on the opinions of similar users. Thus, selecting those similar users is an important step in the recommendation, known as neighborhood selection. In real applications with millions of users and items, this step can be a serious performance bottleneck because of the huge number of operations needed. In this paper we study the possibility of pre-computing the neighbors in an offline step, in order to increase recommendation efficiency. We show how neighborhood pre-computation reduces the recommendation time by two orders of magnitude without a significant impact in recommendation precision.

Paper Nr: 54
Title:

A Syntax-oriented Event Extraction Approach

Authors:

Sebastian Fleissner and Alex Chengyu Fang

Abstract: This paper proposes a domain-independent, syntax-oriented event extraction approach, which draws on a rich syntactic description of both the syntactic categories and the clausal functions of the input text. The approach is applied to the BioNLP Shared Task 2011 dataset and evaluated using the datasets supplied with the GENIA, Epi-PTM, Bacteria-Interactions, and Infectious Diseases tasks. The evaluation shows an overall F-score of 47.41% for four event extraction tasks, comparable to other domain-specific approaches. It is significant that the promising results are achieved on the basis of syntactic information alone, without domain-specific knowledge, suggesting the benefit of combining syntactically oriented descriptions about the surface structure and the use of domain-specific configuration for the challenging task of natural language understanding.

Paper Nr: 59
Title:

Issues of Optimization of a Genetic Algorithm for Traffic Network Division using a Genetic Algorithm

Authors:

Tomas Potuzak

Abstract: In this paper, we describe an approach to optimization of the genetic algorithm for division of road traffic network, which we developed. The division of road traffic network is necessary to enable to perform the road traffic simulation in a distributed computing environment. The optimization approach is based on a genetic algorithm, which is employed to find best settings of the optimized genetic algorithm for road traffic network division. Because such an optimizing genetic algorithm is expected to be extremely computation- and time-consuming, its distributed implementation is discussed as well.

Paper Nr: 62
Title:

On the Effectiveness and Optimization of Information Retrieval for Cross Media Content

Authors:

Pierfrancesco Bellini, Daniele Cenni and Paolo Nesi

Abstract: In recent years, the growth of Social Network communities has posed new challenges for content providers and distributors. Digital contents and their rich multilingual metadata sets need improved solutions for an efficient content management. This paper presents an indexing and searching solution for cross media content, developed for a Social Network in the domain of Performing Arts. The research aims to cope with the complexity of a heterogeneous indexing semantic model, with tuning techniques for discrimination of relevant metadata terms. Effectiveness and optimization analysis of the retrieval solution are presented with relevant metrics. The research is conducted in the context of the ECLAP project (http://www.eclap.eu).

Paper Nr: 63
Title:

A Hybrid Solution for Imbalanced Classification Problems - Case Study on Network Intrusion Detection

Authors:

Camelia Lemnaru, Andreea Tudose-Vintila, Andrei Coclici and Rodica Potolea

Abstract: Imbalanced classification problems represent a current challenge for the application of data mining techniques to real-world problems, since learning algorithms are biased towards favoring the majority class(es). The present paper proposes a compound classification architecture for dealing with imbalanced multi-class problems. It comprises of a two-level classification system: a multiple classification model on the first level, which combines the predictions of several binary classifiers, and a supplementary classification model, specialized on identifying “difficult” cases, which is currently under development. Particular attention is allocated to the pre-processing step, with specific data manipulation operations included. Also, a new prediction combination strategy is proposed, which applies a hierarchical decision process in generating the output prediction. We have performed evaluations using an instantiation of the proposed model applied to the field of network intrusion detection. The evaluations performed on a dataset derived from the KDD99 data have indicated that our method yields a superior performance for the minority classes to other similar systems from literature, without degrading the overall performance.

Paper Nr: 64
Title:

A System for Historical Documents Transcription based on Hierarchical Classification and Dictionary Matching

Authors:

Camelia Lemnaru, Andreea Sin-Neamțiu, Mihai-Andrei Vereș and Rodica Potolea

Abstract: Information contained in historical sources is highly important for the research of historians; yet, extracting it manually from documents written in difficult scripts is often an expensive and time-consuming process. This paper proposes a modular system for transcribing documents written in a challenging script (German Kurrent Schrift). The solution comprises of three main stages: Document Processing, Word Processing and Word Selector, chained together in a linear pipeline. The system is currently under development, with several modules in each stage already implemented and evaluated. The main focus so far has been on the character recognition module, where a hierarchical classifier is proposed. Preliminary evaluations on the character recognition module has yielded ~ 82% overall character recognition rate, and a series of groups of confusable characters, for which an additional identification model is currently investigated. Also, word composition based on a dictionary matching approach using the Levenshtein distance is presented.

Paper Nr: 67
Title:

A Framework for Situation Inference based on Belief Function Theory

Authors:

Ladislav Beranek

Abstract: The ability to identify the occurrence of a situation is the main function of context-aware systems. The process of identifying a situation is not easy due to the uncertain nature of the processed information. We use the belief function theory to detect specific situations on the basis of uncertain sensor data. In this paper, we propose a framework for situation awareness based on the belief function theory which is applied to determination of situation occurrence from uncertain sensor data. The framework consists of the situation sensors data processing (filtering, integration) and of situation detection based on alternative frames of discernment generation. The case study demonstrates that the proposed framework is effective and can be used to situation detection.

Paper Nr: 70
Title:

A Keyphrase Extraction Approach for Social Tagging Systems

Authors:

Felice Ferrara and Carlo Tasso

Abstract: Social tagging systems allow people to classify resources by using a set of freely chosen terms named tags. However, by shifting the classification task from a set of experts to a larger and not trained set of people, the results of the classification are not accurate. The lack of control and guidelines generates noisy tags (i.e. tags without a clear semantics) which deteriorate the precision of the user generated classifications. In order to face this limitation several tools have been proposed in the literature for suggesting to the users tags which properly describe a given resource. In this paper we propose to suggest n-grams (named keyphrases) by following the idea that sequences of two/three terms can better face potential ambiguities. More specifically, in this work, we identify a set of features which characterize n-grams able to describe meaningful aspects reported in Web pages. By means of these features we developed a mechanism which can support people to manually classify Web pages by automatically suggesting meaningful keyphrases expressed in English.

Paper Nr: 71
Title:

Knowledge Discovery in the Smart Grid - A Machine Learning Approach

Authors:

Aldo Dagnino

Abstract: The increased availability of cheaper sensing technologies, the implementation of fibre-optic networks, the availability of cheaper data storage repositories, and development of powerful machine learning models are fundamental components that provide a new facet to the concept of the Smart Power Grid. An important element in the Smart Grid concept is predicting potential fault events in the Smart Power Grid, or better known as fault prognostics. This paper discusses an approach that uses machine learning methods to discover fault event-related knowledge from historical data and helps in the prognostics of fault events in power grids and critical and expensive components such as power transformers circuit breakers, and others.

Paper Nr: 73
Title:

LitRec vs. Movielens - A Comparative Study

Authors:

Paula Cristina Vaz, Ricardo Ribeiro and David Martins de Matos

Abstract: Recommendation is an important research area that relies on the availability and quality of the data sets in order to make progress. This paper presents a comparative study between Movielens, a movie recommendation data set that has been extensively used by the recommendation system research community, and LitRec, a newly created data set for content literary book recommendation, in a collaborative filtering set-up. Experiments have shown that when the number of ratings of Movielens is reduced to the level of LitRec, collaborative filtering results degrade and the use of content in hybrid approaches becomes important.

Paper Nr: 80
Title:

A Bayesian Approach for Constructing Ensemble Neural Network

Authors:

Sai Hung Cheung, Yun Zhang and Zhiye Zhao

Abstract: Ensemble neural networks (ENNs) are commonly used in many engineering applications due to its better generalization properties compared with a single neural network (NN). As the NN architecture has a significant influence on the generalization ability of an NN, it is crucial to develop a proper algorithm to design the NN architecture. In this paper, an ENN which combines the component networks by using the Bayesian approach and stochastic modelling is proposed. The cross validation data set is used not only to stop the network training, but also to determine the weights of the component networks. The proposed ENN searches the best structure of each component network first and employs the Bayesian approach as an automating design tool to determine the best combining weights of the ENN. Peak function is used to assess the accuracy of the proposed ensemble approach. The results show that the proposed ENN outperforms ENN obtained by simple averaging and the single NN.

Paper Nr: 86
Title:

Architecture for a Garbage-less and Fresh Content Search Engine

Authors:

Víctor M. Prieto, Manuel Álvarez, Rafael López García and Fidel Cacheda

Abstract: This paper presents the architecture of a Web search engine that integrates solutions for several state-of-the-art problems, such as Web Spam and Soft-404 detection, content update and resource use. To this end, the system incorporates a Web Spam detection module that is based on techniques that have been presented in previous works and whose success have been assessed in well-known public datasets. For the Soft-404 pages we propose some new techniques that improve the ones described in the state of the art. Finally, a last module allows the search engine to detect when a page has changed considering the user interaction. The tests we have performed allow us to conclude that, with the architecture we propose, it is possible to achieve important improvements in the efficacy and the efficiency of crawling systems. This has repercussions in the content that is provided to the users.

Paper Nr: 87
Title:

A New Compaction Algorithm for LCS Rules - Breast Cancer Dataset Case Study

Authors:

Faten Kharbat, Larry Bull and Mohammed Odeh

Abstract: This paper introduces a new compaction algorithm for the rules generated by learning classifier systems that overcomes the disadvantages of previous algorithms in complexity, compacted solution size, accuracy and usability. The algorithm is tested on a Wisconsin Breast Cancer Dataset (WBC) which is a well well-known breast cancer datasets from the UCI Machine Learning Repository.

Paper Nr: 89
Title:

Classification of Datasets with Frequent Itemsets is Wild

Authors:

Natalia Vanetik

Abstract: The problem of dataset classification with frequent itemsets is defined as the problem of determining whether or not two different datasets have the same frequent itemsets without computing these itemsets explicitly. The reasoning behind this approach is high computational cost of computing frequent itemsets. Finding welldefined and understandable normal forms for this classification task would be a breakthrough in dataset classification field. The paper proves that classification of datasets with frequent itemsets is a hopeless task since canonical forms do not exist for this problem.

Paper Nr: 103
Title:

On the Generation of Dynamic Business Indicators

Authors:

Fábio Alexandre Pereira dos Santos, Rui César das Neves and Joaquim Belo Filipe

Abstract: While information is rapidly gaining relevance to the organizations, systems that help companies analyse that information need to improve their effectiveness in several layers. We address, in our on-going research work, mainly the presentation layer and business layer of software systems development. The aim of this position paper is to discuss how to develop a system that is as flexible and configurable as possible, which allows multiple methods of analysing and visualizing the relevant data to an organization, allowing them to define certain views on business data with appropriate graphics. One of the values from this system is the domain of technology in a controlled environment, because it helps solving a specific problem, since it is both a generic tool, autonomously and with a high degree of adaptability. Flexibility will go hand-in-hand with ensuring the consistence of the data model, i.e. the data being analyzed by an user must respect syntactic and semantic constraints when relations to each other created in the organization’s logic. This feature prevents the user to attempt to connect data that aren’t related. Based on the existing data types described on metadata in the data model, the system provides the users with a list of possible graphical representations for the selected information type. This list will be filtered in order to allow the user to select only graphical representation types that are appropriate to the selected data types. This is an innovative feature, in the sense that the system constrains the selection of the visualization elements thus avoiding potential conceptual errors.