KDIR 2011 Abstracts


Full Papers
Paper Nr: 14
Title:

A FAST ALGORITHM FOR MINING GRAPHS OF PRESCRIBED CONNECTIVITY

Authors:

Natalia Vanetik

Abstract: Many real-life data sets, such as social and biological networks and biochemical data, are naturally and easily modeled as large labeled graphs. Finding patterns of interest in these graphs is an important task, due to the nature of the data not all of the patterns need to be taken into account. Intuitively, if a pattern has high connectivity, it implies that there is a strong connection between data items. In this paper, we present a novel algorithm for finding frequent graph patterns with prescribed connectivity in large single-graph data sets. We employ the Dinitz-Karzanov-Lomonosov cactus minimum cut structure of a graph to perform the task efficiently. We also prove that the suggested algorithm generates no more candidate graphs than any other algorithm whose graph extension procedure we use at the first step.

Paper Nr: 30
Title:

CASCADE OF MULTI-LEVEL MULTI-INSTANCE CLASSIFIERS FOR IMAGE ANNOTATION

Authors:

Cam-Tu Nguyen, Ha Vu Le and Takeshi Tokuyama

Abstract: This paper introduces a new scheme for automatic image annotation based on cascading multi-level multiinstance classifiers (CMLMI). The proposed scheme employs a hierarchy for visual feature extraction, in which the feature set includes features extracted from the whole image at the coarsest level and from the overlapping sub-regions at finer levels. Multi-instance learning (MIL) is used to learn the “weak classifiers” for these levels in a cascade manner. The underlying idea is that the coarse levels are suitable for background labels such as “forest” and “city”, while finer levels bring useful information about foreground objects like “tiger” and “car”. The cascade manner allows this scheme to incorporate “important” negative samples during the learning process, hence reducing the “weakly labeling” problem by excluding ambiguous background labels associated with the negative samples. Experiments show that the CMLMI achieve significant improvements over baseline methods as well as existing MIL-based methods. improvements over baseline methods as well as existing MIL-based methods.

Paper Nr: 45
Title:

CONCEPT DISCOVERY FOR LANGUAGE UNDERSTANDING IN AN INFORMATION-QUERY DIALOGUE SYSTEM

Authors:

Nathalie Camelin, Boris Detienne, Stéphane Huet, Dominique Quadri and Fabrice Lefevre

Abstract: Most recent efficient statistical approaches for natural language understanding require a segmental annotation of training data. Such an annotation implies both to determine the concepts in a sentence and to link them to their corresponding word segments. In this paper we propose a two-steps alternative to the fully manual annotation of data: an initial unsupervised concept discovery, based on latent Dirichlet allocation, is followed by an automatic segmentation using integer linear optimisation. The relation between discovered topics and task-dependent concepts is evaluated on a spoken dialogue task for which a reference annotation is available. Topics and concepts are shown close enough to achieve a potential reduction of one half of the manual annotation cost.

Paper Nr: 49
Title:

A COMPARATIVE EVALUATION OF PROXIMITY MEASURES FOR SPECTRAL CLUSTERING

Authors:

Nadia Farhanaz Azam and Herna L. Viktor

Abstract: A cluster analysis algorithm is considered successful when the data is clustered into meaningful groups so that the objects in the same group are similar, and the objects residing in two different groups are different from one another. One such cluster analysis algorithm, the spectral clustering algorithm, has been deployed across numerous domains ranging from image processing to clustering protein sequences with a wide range of data types. The input, in this case, is a similarity matrix, constructed from the pair-wise similarity between the data objects. The pair-wise similarity between the objects is calculated by employing a proximity (similarity, dissimilarity or distance) measure. It follows that the success of a spectral clustering algorithm therefore heavily depends on the selection of the proximity measure. While, the majority of prior research on the spectral clustering algorithm emphasizes the algorithm-specific issues, little research has been performed on the evaluation of the performance of the proximity measures. To this end, we perform a comparative and exploratory analysis on several existing proximity measures to evaluate their suitability for the spectral clustering algorithm. Our results indicate that the commonly used Euclidean distance measure may not always be a good choice especially in domains where the data is highly imbalanced and the correct clustering of the boundary objects are crucial. Furthermore, for numeric data, measures based on the relative distances often yield better results than measures based on the absolute distances, specifically when aiming to cluster boundary objects. When considering mixed data, the measure for numeric data has the highest impact on the final outcome and, again, the use of the Euclidian measure may be inappropriate.

Paper Nr: 50
Title:

MULTI-OUTPUT RANKING FOR AUTOMATED REASONING

Authors:

Daniel Kühlwein, Josef Urban, Evgeni Tsivtsivadze, Herman Geuvers and Tom Heskes

Abstract: Premise selection and ranking is a pressing problem for applications of automated reasoning to large formal theories and knowledge bases. Smart selection of premises has a significant impact on the efficiency of automated proof assistant systems in large theories. Despite this, machine-learning methods for this domain are underdeveloped. In this paper we propose a general learning algorithm to address the premise selection problem. Our approach consists of simultaneous training of multiple predictors that learn to rank a set of premises in order to estimate their expected usefulness when proving a new conjecture. The proposed algorithm efficiently constructs prediction functions and can take correlations among multiple tasks into account. The experiments demonstrate that the proposed method significantly outperforms algorithms previously applied to the task.

Paper Nr: 62
Title:

TIME SERIES SEGMENTATION AS A DISCOVERY TOOL - A Case Study of the US and Japanese Financial Markets

Authors:

Jian Cheng Wong, Gladys Hui Ting Lee, Yiting Zhang, Woei Shyr Yim, Robert Paulo Fornia, Danny Yuan Xu, Jun Liang Kok and Siew Ann Cheong

Abstract: In this paper we explain how the dynamics of a complex system can be understood in terms of the lowdimensional manifolds (phases), described by slowly varying effective variables, it settles onto. We then explain how we can discover these phases by grouping the large number of microscopic time series or time series segments, based on their statistical similarities, into the a small number of time series classes, each representing a distinct phase. We describe a specific recursive scheme for time series segmentation based on the Jensen-Shannon divergence, and check its performance against artificial time series data. We then apply the method on the high-frequency time series data of various US and Japanese financial market indices, where we found that the time series segments can be very naturally grouped into four to six classes, corresponding roughly with economic growth, economic crisis, market correction, and market crash. From a single time series, we can estimate the lifetimes of these macroeconomic phases, and also identify potential triggers for each phase transition. From a cross section of time series, we can further estimate the transition times, and also arrive at an unbiased and detailed picture of how financial markets react to internal or external stimuli.

Paper Nr: 74
Title:

CHARACTERIZING RELATIONSHIPS THROUGH CO-CLUSTERING - A Probabilistic Approach

Authors:

Nicola Barbieri, Gianni Costa, Giuseppe Manco and Ettore Ritacco

Abstract: In this paper we propose a probabilistic co-clustering approach for pattern discovery in collaborative filtering data. We extend the Block Mixture Model in order to learn about the structures and relationships within preference data. The resulting model can simultaneously cluster users into communities and items into categories. Besides its predictive capabilities, the model enables the discovery of significant knowledge patterns, such as the analysis of common trends and relationships between items and users within communities/categories. We reformulate the mathematical model and implement a parameter estimation technique. Next, we show how the model parameters enable pattern discovery tasks, namely: (i) to infer topics for each items category and characteristic items for each user community; (ii) to model community interests and transitions among topics. Experiments on MovieLens data provide evidence about the effectiveness of the proposed approach.

Paper Nr: 84
Title:

DYNAMIC ANALYSIS OF MALWARE USING DECISION TREES

Authors:

Ravinder R. Ravula, Chien-Chung Chan and Kathy J. Liszka

Abstract: Detecting new and unknown malware is a major challenge in today¹s software security profession. Most existing works for malware detection are based on static features of malware. In this work, we applied a reversed engineering process to extract static and behavioural features from malware. Two data sets are created based on reversed features and API Call features. Essential features are identified by applying Weka’s J48 decision tree classifier to 582 malware and 521 benign software samples collected from the Internet. The performance of decision tree and Naïve Bayes classifiers are evaluated by 5-fold cross validation with 80-20 splits of training sets. Experimental results show that Naïve Bayes classifier has better performance on the smaller data set with 12 reversed features, while J48 has better performance on the data set created from the API Call data set with 141 features.

Paper Nr: 85
Title:

A NOVEL QUERY EXPANSION TECHNIQUE BASED ON A MIXED GRAPH OF TERMS

Authors:

Fabio Clarizia, Francesco Colace, Massimo De Santo, Luca Greco and Paolo Napoletano

Abstract: It is well known that one way to improve the accuracy of a text retrieval system is to expand the original query with additional knowledge coded through topic-related terms. In the case of an interactive environment, the expansion, which is usually represented as a list of words, is extracted from documents whose relevance is known thanks to the feedback of the user. In this paper we argue that the accuracy of a text retrieval system can be improved if we employ a query expansion method based on a mixed Graph of Terms representation instead of a method based on a simple list of words. The graph, that is composed of a directed and an undirected subgraph, can be automatically extracted from a small set of only relevant documents (namely the user feedback) using a method for term extraction based on the probabilistic Topic Model. The evaluation of the proposed method has been carried out by performing a comparison with two less complex structures: one represented as a set of pairs of words and another that is a simple list of words.

Paper Nr: 88
Title:

A WEAKLY SUPERVISED APPROACH FOR LARGE-SCALE RELATION EXTRACTION

Authors:

Ludovic Jean-Louis, Romaric Besançon, Olivier Ferret and Adrien Durand

Abstract: Standard Information Extraction (IE) systems are designed for a specific domain and a limited number of relations. Recent work has been undertaken to deal with large-scale IE systems. Such systems are characterized by a large number of relations and no restriction on the domain, which makes difficult the definition of manual resources or the use of supervised techniques. In this paper, we present a large-scale IE system based on a weakly supervised method of pattern learning. This method uses pairs of entities known to be in relation to automatically extract example sentences from which the patterns are learned. We present the results of this system on the data from the KBP task of the TAC 2010 evaluation campaign.

Paper Nr: 94
Title:

A TRANSACTIONAL APPROACH TO ASSOCIATIVE XML CLASSIFICATION BY CONTENT AND STRUCTURE

Authors:

Gianni Costa, Ettore Ritacco and Riccardo Ortale

Abstract: We propose XCCS, which is short for XML Classification by Content and Structure, a new approach for the induction of intelligible classification models for XML data, that are a valuable support for more effective and efficient XML search, retrieval and filtering. The idea behind XCCS is to represent each XML document as a transaction in a space of boolean features, that are informative of its content and structure. Suitable algorithms are developed to learn associative classifiers from the transactional representation of the XML data. XCCS induces very compact classifiers with outperforming effectiveness compared to several established competitors.

Paper Nr: 107
Title:

GRAPHICAL MODELS FOR RELATIONS - Modeling Relational Context

Authors:

Volker Tresp, Yi Huang, Xueyan Jiang and Achim Rettinger

Abstract: We derive a multinomial sampling model for analyzing the relationships between two or more entities. The parameters in the multinomial model are derived from factorizing multi-way contingency tables. We show how contextual information can be included and propose a graphical representation of model dependencies. The graphical representation allows us to decompose a multivariate domain into interactions involving only a small number of variables. The approach formulates a probabilistic generative model for a single relation. By construction, the approach can easily deal with missing relations. We apply our approach to a social network domain where we predict the event that a user watches a movie. Our approach permits the integration of both information about the last movie watched by a user and a general temporal preference for a movie.

Paper Nr: 146
Title:

A MULTI-SEQUENCE ALIGNMENT ALGORITHM FOR WEB TEMPLATE DETECTION

Authors:

Filippo Geraci and Marco Maggini

Abstract: Nowadays most of Web pages are automatically assembled by content management systems or editing tools that apply a fixed template to give a uniform structure to all the documents beloging to the same site. The template usually contains side information that provides better graphics, navigation bars and menus, banners and advertisements that are aimed to improve the users’ browsing experience but may hinder tools for automatic processing of Web documents. In this paper, we present a novel template removing technique that exploits a sequence alignment algorithm from bioinformatics that is able to automatically extract the template from a quite small sample of pages from the same site. The algorithm detects the common structure of HTML tags among pairs of pages and merges the partial hypotheses using a binary tree consensus schema. The experimental results show that the algorithm is able to attain a good precision and recall in the retrieval of the real template structure exploiting just 16 sample pages from the site. Moreover, the positive impact of the template removing technique is shown on a Web page clustering task.

Short Papers
Paper Nr: 9
Title:

PROGRAMMING THE KDD PROCESS USING XQUERY

Authors:

Andrea Romei and Franco Turini

Abstract: XQuake is a language and system for programming data mining processes over native XML databases in the spirit of inductive databases. It extends XQuery to support KDD tasks. This paper focuses on the features required in the definition of the steps of the mining process. The main objective is to show the expressiveness of the language in handling mining operations as an extension of basic XQuery expressions. To this purpose, the paper offers an extended application in the field of analyzing web logs.

Paper Nr: 13
Title:

HASHMAX: A NEW METHOD FOR MINING MAXIMAL FREQUENT ITEMSETS

Authors:

Natalia Vanetik and Ehud Gudes

Abstract: Mining maximal frequent itemsets is a fundamental problem in many data mining applications, especially in the case of dense data when the search space is exponential. We propose a top-down algorithm that employs hashing techniques, named HashMax, in order to generate maximal frequent itemsets efficiently. An empirical evaluation of our algorithm in comparison with the state-of-the-art maximal frequent itemset generation algorithm Genmax shows the advantage of HashMax in the case of dense datasets with a large amount of maximal frequent itemsets.

Paper Nr: 21
Title:

SEMANTIC MINING OF DOCUMENTS IN A RELATIONAL DATABASE

Authors:

Kunal Mukerjee, Todd Porter and Sorin Gherman

Abstract: Automatically mining entities, relationships, and semantics from unstructured documents and storing these in relational tables, greatly simplifies and unifies the work flows and user experiences of database products at the Enterprise. This paper describes three linear scale, incremental, and fully automatic semantic mining algorithms that are at the foundation of the new Semantic Platform being released in the next version of SQL Server. The target workload is large (10 – 100 million) enterprise document corpuses. At these scales, anything short of linear scale and incremental is costly to deploy. These three algorithms give rise to three weighted physical indexes: Tag Index (top keywords in each document); Document Similarity Index (top closely related documents given any document); and Phrase Similarity Index (top semantically related phrases, given any phrase), which are then query-able through the SQL interface. The need for specifically creating these three indexes was motivated by observing typical stages of document research, and gap analysis, given current tools and technology at the Enterprise. We describe the mining algorithms and architecture, and outline some compelling user experiences that are enabled by these indexes.

Paper Nr: 27
Title:

WIKIPEDIA AS DOMAIN KNOWLEDGE NETWORKS - Domain Extraction and Statistical Measurement

Authors:

Zheng Fang, Jie Wang, Benyuan Liu and Weibo Gong

Abstract: This paper investigates knowledge networks of specific domains extracted from Wikipedia and performs statistical measurements to selected domains. In particular, we first present an efficient method to extract a specific domain knowledge network from Wikipedia. We then extract four domain networks on, respectively, mathematics, physics, biology, and chemistry. We compare the mathematics domain network extracted from Wikipedia with MathWorld, the web’s most extensive mathematical resource created and maintained by professional mathematicians, and show that they are statistically similar to each other. This indicates that MathWorld and Wikipedia’s mathematics domain knowledge share a similar internal structure. Such information may be useful for investigating knowledge networks.

Paper Nr: 32
Title:

MULTI-CLASS DATA CLASSIFICATION FOR IMBALANCED DATA SET USING COMBINED SAMPLING APPROACHES

Authors:

Wanthanee Prachuabsupakij and Nuanwan Soonthornphisaj

Abstract: Two important challenges in machine learning are the imbalanced class problem and multi-class classification, because several real-world applications have imbalanced class distribution and involve the classification of data into classes. The primary problem of classification in imbalanced data sets concerns measure of performance. The performance of standard learning algorithm tends to be biased towards the majority class and ignore the minority class. This paper presents a new approach (KSAMPLING), which is a combination of k-means clustering and sampling methods. K-means algorithm is used for spitting the dataset into two clusters. After that, we combine two types of sampling technique, over-sampling and under-sampling, to re-balance the class distribution. We have conducted experiments on five highly imbalanced datasets from the UCI. Decision trees are used to classify the class of data. The experimental results showed that the prediction performance of KSAMPLING is better than the state-of-the-art methods in the AUC results and F-measure are also improved.

Paper Nr: 39
Title:

UNSUPERVISED HANDWRITTEN GRAPHICAL SYMBOL LEARNING - Using Minimum Description Length Principle on Relational Graph

Authors:

Jinpeng Li, Christian Viard-Gaudin and Harold Mouchere

Abstract: Generally, the approaches encountered in the field of handwriting recognition require the knowledge of the symbol set, and of as many as possible ground-truthed samples, so that machine learning based approaches can be implemented. In this work, we propose the discovery of the symbol set that is used in the context of a graphical language produced by on-line handwriting. We consider the case of a two-dimensional graphical language such as mathematical expression composition, where not only left to right layouts have to be considered. Firstly, we select relevant graphemes using hierarchical clustering. Secondly, we build a relational graph between the strokes defining an handwritten expression. Thirdly, we extract the lexicon which is a set of graph substructures using the minimum description length principle. For the assessment of the extracted lexicon, a hierarchical segmentation task is introduced. From the experiments we conducted, a recall rate of 84.2% is reported on the test part of our database produced by 100 writers.

Paper Nr: 46
Title:

A TALE OF TWO (SIMILAR) CITIES - Inferring City Similarity through Geo-spatial Query Log Analysis

Authors:

Rohan Seth, Michele Covell, Deepak Ravichandran, D. Sivakumar and Shumeet Baluja

Abstract: Understanding the backgrounds and interest of the people who are consuming a piece of content, such as a news story, video, or music, is vital for the content producer as well the advertisers who rely on the content to provide a channel on which to advertise. We extend traditional search-engine query log analysis, which has primarily concentrated on analyzing either single or small groups of queries or users, to examining the complete query stream of very large groups of users – the inhabitants of 13,377 cities across the United States. Query logs can be a good representation of the interests of the city’s inhabitants and a useful characterization of the city itself. Further, we demonstrate how query logs can be effectively used to gather city-level statistics sufficient for providing insights into the similarities and differences between cities. Cities that are found to be similar through the use of query analysis correspond well to the similar cities as determined through other large-scale and time-consuming direct measurement studies, such as those undertaken by the Census Bureau.

Paper Nr: 51
Title:

FINDING THE RIGHT EXPERT - Discriminative Models for Expert Retrieval

Authors:

Philipp Sorg and Philipp Cimiano

Abstract: We tackle the problem of expert retrieval in Social Question Answering (SQA) sites. In particular, we consider the task of, given an information need in the form of a question posted in a SQA site, ranking potential experts according to the likelihood that they can answer the question. We propose a discriminative model (DM) that allows to combine different sources of evidence in a single retrieval model using machine learning techniques. The features used as input for the discriminative model comprise features derived from language models, standard probabilistic retrieval functions and features quantifying the popularity of an expert in the category of the question. As input for the DM, we propose a novel feature design that allows to exploit language models as features. We perform experiments and evaluate our approach on a dataset extracted from Yahoo! Answers, recently used as benchmark in the CriES Workshop, and show that our proposed approach outperforms i) standard probabilistic retrieval models, ii) a state-of-the-art expert retrieval approach based on language models as well as iii) an established learning to rank model.

Paper Nr: 60
Title:

COMBINING TWO LAZY LEARNING METHODS FOR CLASSIFICATION AND KNOWLEDGE DISCOVERY - A Case Study for Malignant Melanoma Diagnosis

Authors:

Eva Armengol and Susana Puig

Abstract: The goal of this paper is to construct a classifier for diagnosing malignant melanoma. We experimented with two lazy learning methods, k-NN and LID, and compared their results with the ones produced by decision trees. We performed this comparison because we are also interested on building a domain model that can serve as basis to dermatologists to propose a good characterization of early melanomas. We shown that lazy learning methods have a better performance than decision trees in terms of sensitivity and specificity. We have seen that both lazy learning methods produce complementary results (k-NN has high specificity and LID has high sensitivity) suggesting that a combination of both could be a good classifier. We report experiments confirming this point. Concerning the construction of a domain model, we propose to use the explanations provided by the lazy learning methods, and we see that the resulting theory is as predictive and useful as the one obtained from decision trees.

Paper Nr: 66
Title:

LEARNING SIMILARITY FUNCTIONS FOR EVENT IDENTIFICATION USING SUPPORT VECTOR MACHINES

Authors:

Timo Reuter and Philipp Cimiano

Abstract: Every clustering algorithm requires a similarity measure, ideally optimized for the task in question. In this paper we are concerned with the task of identifying events in social media data and address the question of how a suitable similarity function can be learned from training data for this task. The task consists essentially in grouping social media documents by the event they belong to. In order to learn a similarity measure using machine learning techniques, we extract relevant events from last.fm and match the unique machine tags for these events to pictures uploaded to Flickr, thus getting a gold standard were each picture is assigned to its corresponding event. We evaluate the similarity measure with respect to accuracy on the task of assigning a picture to its correct event. We use SVMs to train an appropriate similarity measure and investigate the performance of different types of SVMs (Ranking SVMs vs. Standard SVMs), different strategies for creating training data as well as the impact of the amount of training data and the kernel used. Our results show that a suitable similarity measure can be learned from a few examples only given a suitable strategy for creating training data. We also show that i) Ranking SVMs can learn from fewer examples, ii) are more robust compared to standard SVMs in the sense that their performance does not vary significantly for different sizes and samples of training data and iii) are not as prone to overfitting as standard SVMs.

Paper Nr: 68
Title:

MULTI-SCALE COMMUNITY DETECTION USING STABILITY AS OPTIMISATION CRITERION IN A GREEDY ALGORITHM

Authors:

Erwan Le Martelot and Chris Hankin

Abstract: Whether biological, social or technical, many real systems are represented as networks whose structure can be very informative regarding the original system’s organisation. In this respect the field of community detection has received a lot of attention in the past decade. Most of the approaches rely on the notion of modularity to assess the quality of a partition and use this measure as an optimisation criterion. Recently stability was introduced as a new partition quality measure encompassing former partition quality measures such as modularity. The work presented here assesses stability as an optimisation criterion in a greedy approach similar to modularity optimisation techniques and enables multi-scale analysis using Markov time as resolution parameter. The method is validated and compared with other popular approaches against synthetic and various real data networks and the results show that the method enables accurate multi-scale network analysis.

Paper Nr: 69
Title:

HANDLING THE IMPACT OF LOW FREQUENCY EVENTS ON CO-OCCURRENCE BASED MEASURES OF WORD SIMILARITY - A Case Study of Pointwise Mutual Information

Authors:

François Role and Mohamed Nadif

Abstract: Statistical measures of word similarity are widely used in many areas of information retrieval and text mining. Among popular word co-occurrence based measures is Pointwise Mutual Information (PMI). Altough widely used, PMI has a well-known tendency to give excessive scores of relatedness to word pairs that involve low frequency words. Many variants of it have therefore been proposed, which correct this bias empirically. In contrast to this empirical approach, we propose formulae and indicators that describe the behavior of these variants in a precise way so that researchers and practitioners can make a more informed decision as to which measure to use in different scenarios.

Paper Nr: 77
Title:

SEMANTIC ENRICHMENT OF CONTEXTUAL ADVERTISING BY USING CONCEPTS

Authors:

Giuliano Armano, Alessandro Giuliani and Eloisa Vargiu

Abstract: This paper focuses on Contextual Advertising, which is devoted to display commercial ads within the content of third-party Web pages. In the literature, several approaches estimate the relevance of an ad based only on syntactic approaches. However, these approaches may lead to the choice of a remarkable number of irrelevant ads. In order to solve these drawbacks, solutions that combine a semantic phase with a syntactic phase have been proposed. Framed within this approach, we propose an approach that uses to a semantic network able to supply commonsense knowledge. To this end, we developed and implemented a system that uses the ConceptNet 3 database. To our best knowledge this is the first attempt to use information provided by ConceptNet in the field of Contextual Advertising. Several experiments have been performed aimed at comparing the proposed system with a state-of-the-art system. Preliminary results show that the proposed system performs better.

Paper Nr: 78
Title:

A NEW FREQUENT SIMILAR TREE ALGORITHM MOTIVATED BY DOM MINING - Using RTDM and its New Variant — SiSTeR

Authors:

Barkol Omer, Bergman Ruth and Golan Shahar

Abstract: The importance of recognizing repeating structures in web applications has generated a large body of work on algorithms for mining the HTML Document Object Model (DOM). A restricted tree edit distance metric, called the Restricted Top Down Metric (RTDM), is most suitable for DOM mining as well as less computationally expensive than the general tree edit distance. Given two trees with input size n1 and n2, the current methods take time O(n1 · n2) to compute RTDM. Consider, however, looking for patterns that form subtrees within a web page with n elements. The RTDM must be computed for all subtrees, and the running time becomes O(n4). This paper proposes a new algorithm which computes the distance between all the subtrees in a tree in time O(n2), which enables us to obtain better quality as well as better performance, on a DOM mining task. In addition, we propose a new tree edit-distance—SiSTeR (Similar Sibling Trees aware RTDM). This variant of RTDMallows considering the case were repetitious (very similar) subtrees of different quantity appear in two trees which are supposed to be considered as similar.

Paper Nr: 81
Title:

USERS INTEREST PREDICTION MODEL - Based on 2nd Markov Model and Inter-transaction Association Rules

Authors:

Yonggong Ren and Alma Leora Culén

Abstract: The 2nd Markov Model and inter-transaction association rules are both known as key technologies for building user interest prediction models. The use of these technologies potentially improves the users surfing experience. The use of the 2nd Markov Model increases the accuracy of predictions, but it does not cover all the data. Therefore, in this paper we propose a dual strategy for a user interest prediction model that includes the entire data set and improves the accuracy of inter-transaction association rules. The foundation of our dual strategy is a new method of building a database based on the degree of user interest. Secondly, we integrate the 2nd Markov Model and inter-transaction association rules for predicting future browsing patterns of users. Experimental results show that this method provides more accurate prediction results than previous similar research.

Paper Nr: 82
Title:

ESTIMATION OF IMPLICIT USER INFLUENCE FROM PROXY LOGS - An Empirical Study on the Effects of Time Difference and Popularity

Authors:

Tomonobu Ozaki and Minoru Etho

Abstract: In this paper, we propose a framework for estimating implicit user influence from proxy logs. For the estimation, we employ a vector representation of user interactions obtained from log data by taking account of popularity of web pages and difference of access time to them. One of the key issues for successful estimation is how to model the popularity and time difference. Since appropriate models depend on application domains, we propose various models of them. We confirm the effectiveness of the proposed framework by conducting experiments on web page recommendation and community discovery for real proxy logs.

Paper Nr: 89
Title:

INFERRING THE SCOPE OF SPECULATION USING DEPENDENCY ANALYSIS

Authors:

Miguel Ballesteros, Virginia Francisco, Alberto Díaz, Jesús Herrera and Pablo Gervás

Abstract: In the last few years speculation detection systems for biomedical texts have been developed successfully, most of them using machine–learning approaches. In this paper we present a system that finds the scope of speculation in English sentences, by means of dependency syntactic analysis. It infers which words are affected by speculation by browsing dependency syntactic structures. Thus, firstly an algorithm detects hedge cuesa. Secondly the scope of these hedge cues is computed. We tested the system with the Bioscope corpus, annotated with speculation and obtaining competitive results compared with the state of the art systems.

Paper Nr: 91
Title:

MEASURING TWITTER USER SIMILARITY AS A FUNCTION OF STRENGTH OF TIES

Authors:

John Conroy, Josephine Griffith and Colm O’Riordan

Abstract: Users of online social networks reside in social graphs, where any given user-pair may be connected or unconnected. These connections may be formal or inferred social links; and may be binary or weighted. We might expect that users who are connected by a social tie are more similar in what they write than unconnected users, and that more strongly connected pairs of users are more similar again than less-strongly connected users, but this has never been formally tested. This work describes a method for calculating the similarity between twitter social entities based on what they have written, before examining the similarity between twitter user-pairs as a function of how tightly connected they are. We show that the similarity between pairs of twitter users is indeed positively correlated with the strength of the tie between them.

Paper Nr: 95
Title:

USE OF DOMAIN KNOWLEDGE FOR DIMENSION REDUCTION - Application to Mining of Drug Side Effects

Authors:

Emmanuel Bresso, Sidahmed Benabderrahmane, Malika Smail-Tabbone, Gino Marchetti, Arnaud Sinan Karaboga, Michel Souchet, Amedeo Napoli and Marie-Dominique Devignes

Abstract: High dimensionality of datasets can impair the execution of most data mining programs and lead to the production of numerous and complex patterns, inappropriate for interpretation by the experts. Thus, dimension reduction of datasets constitutes an important research orientation in which the role of domain knowledge is essential. We present here a new approach for reducing dimensions in a dataset by exploiting semantic relationships between terms of an ontology structured as a rooted directed acyclic graph. Term clustering is performed thanks to the recently described IntelliGO similarity measure and the term clusters are then used as descriptors for data representation. The strategy reported here is applied to a set of drugs associated with their side effects collected from the SIDER database. Terms describing side effects belong to the MedDRA terminology. The hierarchical clustering of about 1,200 MedDRA terms into an optimal collection of 112 term clusters leads to a reduced data representation. Two data mining experiments are then conducted to illustrate the advantage of using this reduced representation.

Paper Nr: 101
Title:

BIO-INSPIRED BAGS-OF-FEATURES FOR IMAGE CLASSIFICATION

Authors:

Wafa Bel Haj Ali, Eric Debreuve, Pierre Kornprobst and Michel Barlaud

Abstract: The challenge of image classification is based on two key elements: the image representation and the algorithm of classification. In this paper, we revisited the topic of image representation. Classical descriptors such as Bag-of-Features are usually based on SIFT. We propose here an alternative based on bio-inspired features. This approach is inspired by a model of the retina which acts as an image filter to detect local contrasts. We show the promising results that we obtained in natural scenes classification with the proposed bio-inspired image representation.

Paper Nr: 102
Title:

A FRAMEWORK FOR STRUCTURED KNOWLEDGE EXTRACTION AND REPRESENTATION FROM NATURAL LANGUAGE THROUGH DEEP SENTENCE ANALYSIS

Authors:

Stefania Costantini, Niva Florio and Alessio Paolucci

Abstract: We present a framework that allow to extract knowledge from natural language sentences using a deep analysis technique based on linguistic dependencies. The extracted knowledge is represented in OOLOT, an intermediate format inspired by the Language of Thought (LOT) and based on Answer Set Programming (ASP). OOLOT uses ontology oriented lexicon and syntax. Finally, it is possible to export the knowledge in OWL and native ASP.

Paper Nr: 103
Title:

A SEMANTIC CLUSTERING APPROACH FOR INDEXING DOCUMENTS

Authors:

Daniel Osuna-Ontiveros, Ivan Lopez-Arevalo and Victor Sosa-Sosa

Abstract: Information retrieval (IR) models process documents for preparing them for search by humans or computers. In the early models, the general idea was making a lexico-syntactic processing of documents, where the importance of the documents retrieved by a query is based on the frequency of its terms in the document. Another approach is return predefined documents based on the type of query the user make. Recently, some researchers have combined text mining techniques to enhance the document retrieval. This paper proposes a semantic clustering approach to improve traditional information retrieval models by representing topics associated to documents. This proposal combines text mining algorithms and natural language processing. The approach does not use a priori queries, instead clusters terms, where each cluster is a set of related words according to the content of documents. As result, a document-topic matrix representation is obtained denoting the importance of topics inside documents. For query processing, each query is represented as a set of clusters considering its terms. Thus, a similarity measure (e.g. cosine similarity) can be applied over this array and the matrix of documents to retrieve the most relevant documents.

Paper Nr: 109
Title:

GENE ONTOLOGY BASED SIMULATION FOR FEATURE SELECTION

Authors:

Christopher E. Gillies, Mohammad-Reza Siadat, Nilesh V. Patel and George Wilson

Abstract: Increasing interest among researchers is evidenced for techniques that incorporate prior biological knowledge into gene expression profile classifiers. Specifically, researchers are interested in learning the impact on classification when prior knowledge is incorporated into a classifier rather than just using the statistical properties of the dataset. In this paper, we investigate this impact through simulation. Our simulation relies on an algorithm that generates gene expression data from Gene Ontology. Experiments comparing two classifiers, one trained using only statistical properties and one trained with a combination of statistical properties and Gene Ontology knowledge, are discussed . Experimental results suggest that incorporating Gene Ontology information improves classifier performance. In addition, we discuss the relationship of distance between means of the distributions of the classes and the training sample size on classification accuracy.

Paper Nr: 110
Title:

IMPACT OF FEATURE SELECTION AND FEATURE TYPES ON FINANCIAL STOCK PRICE PREDICTION

Authors:

Michael Hagenau, Michael Liebmann and Dirk Neumann

Abstract: In this paper, we examine whether stock price effects can be automatically predicted analyzing unstructured textual information in financial news. Accordingly, we enhance existing text mining methods to evaluate the information content of financial news as an instrument for investment decisions. The main contribution of this paper is the usage of more expressive features to represent text through the employment of market feedback as part of our word selection process. In a comprehensive benchmarking, we show that a robust Feature Selection allows lifting classification accuracies significantly above previous approaches when combined with complex feature types. That is because our approach allows selecting only semantically relevant features and thus, reduces the problem of over-fitting when applying a machine learning approach. The methodology can be transferred to any other application area providing textual information and corresponding effect data.

Paper Nr: 111
Title:

AUTOMATIC ESTIMATION OF THE LSA DIMENSION

Authors:

Jorge Fernandes, Andreia Artífice and Manuel J. Fonseca

Abstract: Nowadays the size of collections of information achieved considerable sizes, making the finding and exploration of a particular subject hard to achieve. One way to solve this problem is through text classification, where a theme or category is assigned to a text based on the analysis of its content. However, existing approaches to text classification require some effort and a high level of knowledge on this subject by the users, making them inaccessible to the common user. Another problem of current approaches is that they are optimized for a specific problem and can not easily be adapted to another context. In particular, unsupervised methods based on the LSA algorithm require users to define the dimension to use in the algorithm. In this paper we describe an approach to make the use of text classification more accessible to common users, by providing a formula to estimate the dimension of the LSA based on the number of texts used during the bootstrapping process. Experimental results show that our formula for estimation of the LSA dimension allows us to create unsupervised solutions able to achieve results similar to supervised approaches.

Paper Nr: 118
Title:

AN IMPROVED METHOD TO SELECT CANDIDATES ON METRIC INDEX VP-TREE

Authors:

Masami Shishibori, Samuel Sangkon Lee and Kenji Kita

Abstract: On multimedia databases, it is one of important techniques to use the efficient indexing method for the fast access. Metric indexing methods can apply for various distance measures other than the Euclidean distance. Then, metric indexing methods have higher flexibility than multi-dimensional indexing methods. We focus on the Vantage Point tree (VP-tree) which is one of the metric indexing methods. VP-tree is an efficient metric space indexing method, however the number of distance calculations at leaf nodes tends to increase. In this paper, we propose an efficient algorithm to reduce the number of distance calculations at leaf nodes of the VPtree. The conventional VP-tree uses the triangle inequality at the leaf node in order to reduce the number of distance calculations. At this point, the vantage point of the VP-tree is used as a reference point of the triangle inequality. The proposed algorithm uses the nearest neighbor (NN) point for the query instead of the vantage point as the reference point. By using this method, the selection range by the triangle inequality becomes small, and the number of distance calculations at leaf nodes can be cut down. Moreover, it is impossible to specify the NN point in advance. Then, this method regards the nearest point to the query in the result buffer as the temporary NN point. If the nearer point is found on the retrieval process, the temporary NN point is replaced with new one. From evaluation experiments using 10,000 image data, it was found that our proposed method could cut 5%12% of search time of the conventional VP-tree.

Paper Nr: 119
Title:

INTEGRATION OF PROFILE IN OLAP SYSTEMS

Authors:

Rezoug Nachida, Omar Boussaid and Fahima Nader

Abstract: OLAP systems facilitate analysis by providing a multidimensional data space which decision makers explore interactively by a succession of OLAP operations. However, these systems are developed for a group of decision makers or topic analysis "subject-oriented", which are presumed, have identical needs. It makes them unsuitable for a particular use. Personalization aims to better take into account the user; first this paper presents a summary of all work undertaken in this direction with a comparative study. Secondly we developed a search algorithm for class association rules between query type and user (s) to deduce the profile of a particular user or a user set in the same category. These will be extracted from the log data file of OLAP server. For this we use a variant of prediction and explanation algorithms. These profiles then form a knowledge base. This knowledge base will be used to generate automatically a rule base (ACE), for assigning weights to the attributes of data warehouses by type of query and user preferences. More it will deduce the best contextual sequence of requests for eventual use in a recommended system.

Paper Nr: 120
Title:

PEOPLE RETRIEVAL LEVERAGING TEXTUAL AND SOCIAL DATA

Authors:

Amin Mantrach and Jean-Michel Renders

Abstract: The growing importance of social media and heterogeneous relational data emphasizes to the fundamental problem of combining different sources of evidence (or modes) efficiently. In this work, we are considering the problem of people retrieval where the requested information consists of persons and not of documents. Indeed, the processed queries contain generally both textual keywords and social links while the target collection consists of a set of documents with social metadata. Traditional approaches tackle this problem by early or late fusion where, typically, a person is represented by two sets of features: a word profile and a contact/link profile. Inspired by cross-modal similarity measures initially designed to combine image and text, we propose in this paper new ways of combining social and content aspects for retrieving people from a collection of documents with social metadata. To this aim, we define a set of multimodal similarity measures between socially-labelled documents and queries, that could then be aggregated at the person level to provide a final relevance score for the general people retrieval problem. Then, we examine particular instances of this problem: author retrieval, recipient recommendation and alias detection. For this purpose, experiments have been conducted on the ENRON email collection, showing the benefits of our proposed approach with respect to more standard fusion and aggregation methods.

Paper Nr: 122
Title:

SEMANTIC RELATIONSHIPS BETWEEN MULTIMEDIA RESOURCES

Authors:

Mohamed Kharrat, Anis Jedidi and Faiez Gargouri

Abstract: Existing systems or architectures does not almost provide a way to localize sub-parts of multimedia objects (e.g. sub regions of images, persons, events…), which represents semantics of resources. In this paper, we describe semantic relationships between resources, how to represent and create them via contextual schema. In our main system, we use languages such as XQuery to query XML resources. Using this contextual schema, the previously hidden query result can be reused to answer a subsequent query.

Paper Nr: 127
Title:

INFERENTIAL MINING FOR RECONSTRUCTION OF 3D CELL STRUCTURES IN ATOMIC FORCE MICROSCOPY IMAGING

Authors:

Mario D'Acunto, Stefano Berrettini, Serena Danti, Michele Lisanti, Mario Petrini, Andrea Pietrabissa and Ovidio Salvetti

Abstract: Atomic Force Microscopy (AFM) is a fundamental tool for the investigation of a wide range of mechanical properties on nanoscale due to the contact interaction between the AFM tip and the sample surface. The focus of this paper is on an algorithm for the reconstruction of 3D stem-differentiated cell structures extracted by typical 2D surface AFM images. The AFM images resolution is limited by the tip-sample convolution due to the combined geometry of the probe tip and the pattern configuration of the sample. This limited resolution limits the accuracy of the correspondent 3D image. To drop unwanted effects, we adopt an inferential method for pre-processing single frame AFM image (low resolution image) building its super-resolution version. Therefore the 3D reconstruction is made on animal cells using a Markov Random Field approach for augmented voxels. The 3D reconstruction should improve unambiguous identification of cells structures. The computation method is fast and can be applied both to multi- and to single-frame images.

Paper Nr: 129
Title:

TO AGGREGATE OR NOT TO AGGREGATE: THAT IS THE QUESTION

Authors:

Eric Paquet, Herna L. Viktor and Hongyu Guo

Abstract: Consider a scenario where one aims to learn models from data being characterized by very large fluctuations that are neither attributable to noise nor outliers. This may be the case, for instance, when examining supermarket ketchup sales, predicting earthquakes and when conducting financial data analysis. In such a situation, the standard central limit theorem does not apply, since the associated Gaussian distribution exponentially suppresses large fluctuations. In this paper, we argue that, in many cases, the incorrect assumption leads to misleading and incorrect data mining results. We illustrate this argument against synthetic data, and show some results against stock market data.

Paper Nr: 136
Title:

KNOWPATS: PATTERNS OF DECLARATIVE KNOWLEDGE - Searching Frequent Knowledge Patterns about Object-orientation

Authors:

Peter Hubwieser and Andreas Mühling

Abstract: In order to better understand the structure of students’ knowledge in computer science, we are trying to identify patterns – in form of frequently occurring subgraphs – in concept maps. Concept maps are an exter-nalization of a person’s declarative knowledge represented as a graph. We propose an algorithm that can be employed to identify frequently occurring subgraphs, based on existing algorithms in that field. We are cur-rently working on a project that will gather concept maps form a large group of freshman in the coming semesters, providing us with extensive material for information mining about the structures of knowledge in CS. We hope to get a better understanding of the relationship between knowledge and competence.

Paper Nr: 140
Title:

CONFIDENCE MEASURE FOR AUTOMATIC FACE RECOGNITION

Authors:

Ladislav Lenc and Pavel Král

Abstract: This paper deals with the use of confidence measure for Automatic Face Recognition (AFR). AFR is realized by the adapted Kepenecki face recognition approach based on the Gabor wavelet transform. This work is motivated by the fact that obtained recognition rate on the real-world corpus is only about 50% which is not sufficient for our application, a system for automatic labelling of the photographs in a large database. The main goal of this work is thus the proposition of the post-processing of the classification result in order to remove the “incorrectly” classified face images. We show that the use of confidence measure to filter out incorrectly recognized faces is beneficial. Two confidence measures are proposed and evaluated on the Czech News Agency (¡CTK) corpus. Experimental results confirm the benefit of the use of confidence measure for the automatic face recognition task.

Posters
Paper Nr: 10
Title:

THE SPANISH WEB IN NUMBERS - Main Features of the Spanish Hidden Web

Authors:

Manuel Álvarez, Fidel Cacheda, Rafael López-García and Víctor M. Prieto

Abstract: This article submits a study about the web sites of the “.es” domains which focuses on the level of use of the technologies that hinder the traversal of the Web to the crawling systems. The study is centred on HTML scripts and forms, since they are two well-known entry points to the “Hidden Web”. For the case of scripts, it pays special attention to redirection and dynamic construction of URLs. The article concludes that a crawler should process those technologies in order to obtain most of the documents of the Web.

Paper Nr: 11
Title:

DETECTING CORRELATIONS BETWEEN HOT DAYS IN NEWS FEEDS

Authors:

Raghvendra Mall, Nahil Jain and Vikram Pudi

Abstract: We use text mining mechanisms to analyze Hot days in news feeds. We build upon the earlier work used to detect Hot topics and assume that we have already attained the Hot days. In this paper we identify the most relevant documents of a topic on a Hot day. We construct a similarity based technique for identifying and ranking these documents. Our aim is to automatically detect chains of hot correlated events over time. We develop a scheme using similarity measures like cosine similarity and KL-divergence to find correlation between these Hot days. For the ‘U.S. Presidential Elections’, the presidential debates which spanned over a week was one such event.

Paper Nr: 20
Title:

APPLICATION OF AN ANT COLONY ALGORITHM - For Song Categorising using Metadata

Authors:

Nadia Lachetar and Halima Bahi

Abstract: Instead of the expansion of the information retrieval systems, the music information retrieval domain is still an open one. One of the promising areas in this context is the audio indexing databases. This paper addresses the problem of indexing database containing songs to enable their effective exploitation. Since, we are interested with songs databases, it is necessary to exploit the specific structure of the song in with each part plays a specific role. We propose to use the title and the artist particularities (in fact each artist tends to compose or sing a specific genre of music). In this article, we present our experiments in automated song categorisation, where we suggest the use of an ant colony algorithm. A naive Bayes algorithm is used as a baseline in our tests.

Paper Nr: 24
Title:

XHITS: LEARNING TO RANK IN A HYPERLINKED STRUCTURE

Authors:

Francisco Benjamim Filho, Raúl Pierre Renteria and Ruy Luiz Milidiú

Abstract: The explosive growth and the widespread accessibility of the Web has led to a surge of research activity in the area of information retrieval on the WWW. This is a huge and rich environment where the web pages can be viewed as a large community of elements that are connected through links due to several issues. The HITS approach introduces two basic concepts, hubs and authorities, which reveal some hidden semantic information from the links. In this paper, we review the XHITS, a generalization of HITS, which expands the model from two to several concepts and present a new Machine Learning algorithm to calibrate an XHITS model. The new learning algorithm uses latent feature concepts. Furthermore, we provide some illustrative examples and empirical tests. Our findings indicate that the new learning approach provides a more accurate XHITS model.

Paper Nr: 25
Title:

DISCOVERING RELATIONSHIP ASSOCIATIONS FROM THE LITERATURE RELATED TO RESEARCH PROJECTS IN SUSTAINABILITY SCIENCE USING ONTOLOGY AND INFERENCE

Authors:

Weisen Guo and Steven B. Kraines

Abstract: Research projects addressing issues related to sustainability often need knowledge from research papers from a wide range of disciplines. A method is developed and assessed for using ontology-based inference to automatically discover knowledge in semantic statements of research papers related to specific research projects in sustainability science. The semantic statements have been constructed using a semi-automatic authoring process to represent the knowledge content of the research papers. The discovered knowledge is expressed in the form of relationship associations that are extracted from semantic statements, where relationship associations are transitive associations between two binary semantically typed relationships that share a connecting entity and that co-occur frequently in the set of semantic statements. An algorithm is presented here for finding interesting relationship associations that are extracted from research papers and related to a given research project. The method is evaluated on a set of semantic statements containing 104 semantic statements describing research papers and 24 semantic statements describing research projects.

Paper Nr: 31
Title:

AN HYBRID APPROACH TO FEATURE SELECTION FOR MIXED CATEGORICAL AND CONTINUOUS DATA

Authors:

Gauthier Doquire and Michel Verleysen

Abstract: This paper proposes an algorithm for feature selection in the case of mixed data. It consists in ranking independently the categorical and the continuous features before recombining them according to the accuracy of a classifier. The popular mutual information criterion is used in both ranking procedures. The proposed algorithm thus avoids the use of any similarity measure between samples described by continuous and categorical attributes, which can be unadapted to many real-world problems. It is able to effectively detect the most useful features of each type and its effectiveness is experimentally demonstrated on four real-world data sets.

Paper Nr: 35
Title:

QUERYING AND MINING SPATIOTEMPORAL ASSOCIATION RULES

Authors:

Hana Alouaoui, Sami Yassine Turki and Sami Faiz

Abstract: This paper presents an approach for mining spatiotemporal association rules. The proposed method is based on the computation of neighborhood relationships between geographic objects during a time interval. This kind of information is extracted from spatiotemporal database by the means of special mining queries enriched by time management parameters. The resulting spatiotemporal predicates are then processed by classical data mining tools in order to generate spatiotemporal association rules.

Paper Nr: 37
Title:

CLASSIFICATION OF DIALOGUE ACTS IN URDU MULTI-PARTY DISCOURSE

Authors:

Samira Shaikh, Tomek Strzalkowski and Nick Webb

Abstract: Classification of dialogue acts constitutes an integral part of various natural language processing applications. In this paper, we present an application of this task to Urdu language online multi-party discourse. With language specific modifications to established techniques such as permutation of word order in detected n-grams and variation of n-gram location, we developed an approach that is novel to this language. Preliminary performance results when compared to baseline are very encouraging for this approach.

Paper Nr: 38
Title:

AN INNOVATIVE PROTOCOL FOR COMPARING PROTEIN BINDING SITES VIA ATOMIC GRID MAPS

Authors:

M. Bicego, A. D. Favia, P. Bisignano, A. Cavalli and V. Murino

Abstract: This paper deals with a novel computational approach that aims to measure the similarities of protein binding sites through comparison of atomic grid maps. The assessment of structural similarity between proteins is a longstanding goal in biology and in structure-based drug design. Instead of focusing on standard structural alignment techniques, mostly based on superposition of common structural elements, the proposed approach starts from a physicochemical description of the proteins’ binding site. We call these atomic grid maps. These maps are preprocessed to reduce the dimensionality of the data while retaining the relevant information. Then, we devise an alignment-based similarity measure, based on a rigid registration algorithm (the Iterative Closest Point –ICP). The proposed approach, tested on a real dataset involving 22 proteins, has shown encouraging results in comparison with standard procedures.

Paper Nr: 42
Title:

METHODS FOR DISCOVERING AND ANALYSIS OF REGULARITIES SYSTEMS - Approach based on Optimal Partitioning of Explanatory Variables Space

Authors:

Senko Oleg and Kuznetsova Anna

Abstract: The goal of discussed Optimal valid partitioning (OVP) method is discovering of regularities describing effect of explanatory variables on outcome value. OVP method is based on searching partitions of explanatory variables space with best possible separation of objects with different levels of outcome variable. Optimal partitions are searched inside several previously defined families by empirical (training) datasets. Random permutation tests are used for assessment of statistical validity and optimization of used models complexity. Additional mathematical tools that are aimed at improving performance of OVP approach are discussed. They include methods for evaluating structure of found regularities systems and estimating importance of explanatory variables. Paper also represents variant of OVP technique that allows to compare effects of explanatory variables on outcome in different groups of objects.

Paper Nr: 52
Title:

AN APPROACH FOR COMBINING SEMANTIC INFORMATION AND PROXIMITY INFORMATION FOR TEXT SUMMARIZATION

Authors:

Hogyeong Jeong and Yeogirl Yun

Abstract: This paper develops and evaluates an approach for combining semantic information with proximity information for text summarization. The approach is based on the proximity language model, which incorporates proximity information into the unigram language model. This paper novelly expands the proximity language model to also incorporate semantic information using latent semantic analysis (LSA). We argue that this approach achieves a good balance between syntactic and semantic information. We evaluate the approach using ROUGE scores on the Text Analysis Conference (TAC) 2009 Summarization task, and find that incorporating LSA into PLM gives improvements over the baseline models.

Paper Nr: 57
Title:

AUTOMATED APPROACH FOR WHOLE BRAIN INFARCTION CORE DELINEATION - Using Non-contrast and Computed Tomography Angiography

Authors:

Petr Maule, Jana Klečková and Vladimír Rohan

Abstract: This article proposes automated approach for whole brain infarction core delineation while using only non-contrast computed tomography and computed tomography angiography. The main aim is to provide additional information measuring infarction core volume while exceeding certain level is contraindication of early recanalization. Process of generation of Perfusion Blood Volume maps is described first followed by description of process of infarction core delineation. Verification of correctness is based on comparison against follow-up examinations. Discussion and future works summarizes weaknesses of the method and steps for improvement.

Paper Nr: 67
Title:

COMPARISON OF GREEDY ALGORITHMS FOR DECISION TREE CONSTRUCTION

Authors:

Abdulaziz Alkhalid, Igor Chikalov and Mikhail Moshkov

Abstract: The paper compares different heuristics that are used by greedy algorithms for constructing of decision trees. Exact learning problem with all discrete attributes is considered that assumes absence of contradictions in the decision table. Reference decision tables are based on 24 data sets from UCI Machine Learning Repository (Frank and Asuncion, 2010). Complexity of decision trees is estimated relative to several cost functions: depth, average depth, and number of nodes. Costs of trees built by greedy algorithms are compared with exact minimums calculated by an algorithm based on dynamic programming. The results associate to each cost function a set of potentially good heuristics that minimize it.

Paper Nr: 72
Title:

NETWORK CLUSTERING BY ADVANCED LABEL PROPAGATION ALGORITHM

Authors:

Krista Rizman Žalik and Borut Žalik

Abstract: Real time community detection is enabled by recently proposed linear time – O(m) on a network with m edges – label propagation algorithm (LPA). LPA finds only local maxima in modularity space. To escape local maxima, we propose LPA* that propagate label of a neighbour node having the most common neighbours in the case when multiple neighbour labels are equally frequent and use multistep try of propagation of each neighbour label in the case when multiple neighbour labels are equally frequent in two successive iterations. Experiments show that LPA* detects communities with high modularity values. LPA* propagation is more stable and improves detection of natural communities while it retains high scalability and simplicity of label propagation.

Paper Nr: 75
Title:

SUITABILITY OF A GENETIC ALGORITHM FOR ROAD TRAFFIC NETWORK DIVISION

Authors:

Tomas Potuzak

Abstract: In this paper, the suitability of a genetic algorithm as a part of a method for division of road traffic network is discussed. The division of traffic network is necessary during the adaptation of the road traffic simulation for distributed computing environment. This environment enables to perform detailed simulation of large traffic networks (e.g. entire cities and larger) in a reasonable time. Genetic algorithms are considered, since they are often employed in both graph partitioning and multi-objective optimization problems. These problems are closely associated with the problem of road traffic network division.

Paper Nr: 76
Title:

LEARNING NEIGHBOURHOOD-BASED COLLABORATIVE FILTERING PARAMETERS

Authors:

J. Griffith, C. O'Riordan and H. Sorensen

Abstract: The work outlined in this paper uses a genetic algorithm to learn the optimal set of parameters for a neighbourhood-based collaborative filtering approach. The motivation is firstly to re-assess whether the default parameter values often used are valid and secondly to assess whether different datasets require different parameter settings. Three datasets are considered in this initial investigation into the approach: Movielens, Bookcrossing and Lastfm.

Paper Nr: 83
Title:

DISCOVERY OF MEETING-PARTICLE LINKS AND THEIR APPLICATION TO MEETING RECOLLECTION SUPPORT

Authors:

Ishitoya Kentaro, Ohira Shigeki and Nagao Katashi

Abstract: To facilitate more efficient regularly-held meetings, it is important to consider the past discussions and recall them in the current discussion. We previously developed a meeting recording system that creates discussion content from casual meetings on the basis of digital whiteboards. In this paper, we describe a discussion tool that enables the content of past discussions to be structured, retrieved, and reused in future meetings. We developed an application system to visualize the context relating different meetings by discovering links between meeting particles that are fragments of meeting content (e.g., text, image, and sketch. This visualization enables participants to recollect past discussions in the current discussion.

Paper Nr: 90
Title:

LINGUISTIC ENGINEERING AND ITS APPLICABILITY TO BUSINESS INTELLIGENCE - Towards an Integrated Framework

Authors:

S. F. J. Otten and M. R. Spruit

Abstract: This paper investigates how linguistic techniques on unstructured text data can contribute to business intelligence processes. Through a literature study covering 99 relevant papers, we identified key business intelligence techniques such as text mining, social mining and opinion mining. The Linguistic Engineering for Business Intelligence (LEBI) framework incorporates these techniques and can be used as a guide or reference for combining techniques on unstructured and structured data.

Paper Nr: 92
Title:

FEATURE DISCRETIZATION AND SELECTION IN MICROARRAY DATA

Authors:

Artur Ferreira and Mário Figueiredo

Abstract: Tumor and cancer detection from microarray data are important bioinformatics problems. These problems are quite challenging for machine learning methods, since microarray datasets typically have a very large number of features and small number of instances. Learning algorithms are thus confronted with the curse of dimensionality, and need to address it in order to be effective. This paper proposes unsupervised feature discretization and selection methods suited for microarray data. The experimental results reported, conducted on public domain microarray datasets, show that the proposed discretization and selection techniques yield competitive and promising results with the best previous approaches. Moreover, the proposed methods efficiently handle multi-class microarray data.

Paper Nr: 108
Title:

EXTRACTING THE MAIN CONTENT OF WEB DOCUMENTS BASED ON A NAIVE SMOOTHING METHOD

Authors:

Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert and Gholamreza Nakhaeizadeh

Abstract: Extracting the main content of web documents, with high accuracy, is an important challenge for researchers working on the web. In this paper, we present a novel language-independent method for extracting the main content of web pages. Our method, called DANAg, in comparison with other main content extraction approaches has high performance in terms of effectiveness and efficiency. The extraction process of data DANAg is divided into four phases. In the first phase, we calculate the length of content and code of fixed segments in an HTML file. The second phase applies a naive smoothing method to highlight the segments forming the main content. After that, we use a simple algorithm to recognize the boundary of the main content in an HTML file. Finally, we feed the selected main content area to our parser in order to extract the main content of the targeted web page.

Paper Nr: 126
Title:

GENETIC PROGRAMMING WITH EMBEDDED FEATURES OF SYMBOLIC COMPUTATIONS

Authors:

Yaroslav V. Borcheninov and Yuri S. Okulovsky

Abstract: Genetic programming is a methodology, widely used in data mining for obtaining the analytic form that describes a given experimental data set. In some cases, genetic programming is complemented by symbolic computations that simplify found expressions. We propose to unify the induction of genetic programming with the deduction of symbolic computations in one genetic algorithm. Our approach was implemented as the .NET library and successfully tested at various data mining problems: function approximation, invariants finding and classification.

Paper Nr: 128
Title:

ENHANCING CLUSTERING NETWORK PLANNING ALGORITHM IN THE PRESENCE OF OBSTACLES

Authors:

Lamia Fattouh Ibrahim

Abstract: Clustering in spatial data mining is to group similar objects based on their distance, connectivity, or their relative density in space. In real word, there exist many physical obstacles such as rivers, lakes, highways and mountains, and their presence may affect the result of clustering substantially. Today existing telephone networks nearing saturation and demand for wire and wireless services continuing to grow, telecommunication engineers are looking at technologies that will deliver sites and can satisfy the required demand and grade of service constraints while achieving minimum possible costs. In this paper, we study the problem of clustering in the presence of obstacles to solve network planning problem. In this paper, COD-DBSCAN algorithm (Clustering with Obstructed Distance - Density-Based Spatial Clustering of Applications with Noise) is developed in the spirit of DBSCAN clustering algorithms. We studied also the problem determine the place of Multi Service Access Node (MSAN) due to the presence of obstacles in area complained of the existence of many mountains such as in Saudi Arabia. This algorithm is Density-based clustering algorithm using BSP-tree and Visibility Graph to calculate obstructed distance. Experimental results and analysis indicate that the COD-DBSCAN algorithm is both efficient and effective.

Paper Nr: 133
Title:

INFORMATION RETRIEVAL IN THE SERVICE OF GENERATING NARRATIVE EXPLANATION - What we Want from GALLURA

Authors:

Ephraim Nissan and Yaakov HaCohen-Kerner

Abstract: Information retrieval (IR) and, all the more so, knowledge discovery (KD), do not exist in isolation: it is necessary to consider the architectural context in which they are invoked in order to fulfil given kinds of tasks. This paper discusses a retrieval-intensive context of use, whose intended output is the generation of narrative explanations in a non-bona-fide, entertainment mode subject to heavy intertextuality and strictly constrained by culture-bound poetic conventions. The GALLURA project, now in the design phase, has a multiagent architecture whose modules thoroughly require IR in order to solve specialist subtasks. By their very nature, such subtasks are best subserved by efficient IR as well as mining capabilities within large textual corpora, or networks of signifiers and lexical concepts, as well as databases of narrative themes, motifs and tale types. The state of the art in AI, NLP, story-generation, computational humour, along with IR and KD, as well as the lessons of the DARSHAN project in a domain closely related to GALLURA’s, make the latter’s goals feasible in principle.

Paper Nr: 134
Title:

CONTEXT DIMENSIONALITY REDUCTION FOR MOBILE PERSONAL INFORMATION ACCESS

Authors:

Andreas Komninos, Athanasios Plessas, Vassilios Stefanis and John Garofalakis

Abstract: We propose an application of the Fastmap algorithm that could provide a breakthrough in the efforts to present mobile personal information to the user in context, and describe our vision for context-driven interfaces generated by this method that will support the richness of data stored in personal devices.

Paper Nr: 139
Title:

CLUSTERING OF HETEROGENEOUSLY TYPED DATA WITH SOFT COMPUTING

Authors:

Angel Kuri-Morales, Luis Enrique Cortes-Berrueco and Daniel Trejo-Baños

Abstract: The problem of finding clusters in arbitrary sets of data has been attempted using different approaches. In most cases, the use of metrics in order to determine the adequateness of the said clusters is assumed. That is, the criteria yielding a measure of quality of the clusters depends on the distance between the elements of each cluster. Typically, one considers a cluster to be adequately characterized if the elements within a cluster are close to one another while, simultaneously, they appear to be far from those of different clusters. This intuitive approach fails if the variables of the elements of a cluster are not amenable to distance measurements, i.e., if the vectors of such elements cannot be quantified. This case arises frequently in real world applications where several variables correspond to categories. The usual tendency is to assign arbitrary numbers to every category: to encode the categories. This, however, may result in spurious patterns: relationships between the variables which are not really there at the offset. It is evident that there is no truly valid assignment which may ensure a universally valid numerical value to this kind of variables. But there is a strategy which guarantees that the encoding will, in general, not bias the results. In this paper we explore such strategy. We discuss the theoretical foundations of our approach and prove that this is the best strategy in terms of the statistical behaviour of the sampled data. We also show that, when applied to a complex real world problem, it allows us to generalize soft computing methods to find the number and characteristics of a set of clusters.

Paper Nr: 141
Title:

HANDLING TABBING AND BACKWARD REFERENCES FOR PREDICTIVE WEB USAGE MINING

Authors:

Geoffray Bonnin, Armelle Brun and Anne Boyer

Abstract: Regarding the number of functionalities offered by current web browsers, two particular functionalities are critical when modeling user’s behavior: the back button and tabs. As usual logs do not contain explicit information about the actions performed with these functionalities, it is a real challenge to retrieve which actions were performed by users and it is becoming a very difficult task to accurately model users’ behavior on the web. In this paper, we propose a strategy that uses radicals to take into account tabs as well as the back-button in web predictive modeling.

Paper Nr: 142
Title:

A STUDY OF PROFIT MINING

Authors:

Yuh-Long Hsieh and Don-Lin Yang

Abstract: In the past decade, association rule mining has been used extensively to discover interesting rules from large databases. However, most of the produced results do not satisfy investors in the financial market. The reason for this is because association rule mining simply uses confidence and support to select interesting patterns while the investor is more interested in the result- trading at high profit and low risk. We propose a novel approach called Profit Mining which provides investors with trading rules including information about profit, risk, and win rate. To show the feasibility and usefulness of our proposal, we use a simple trading model of an inter-day trading simulation. This mining approach works well not only in the stock market, but also in the futures and other markets.

Paper Nr: 144
Title:

SEMANTIC GRAPHS AND ARC CONSISTENCY CHECKING - The Renewal of an Old Approach for Information Extraction from Images

Authors:

Aline Deruyver and Yann Hodé

Abstract: The aim of this paper is to show that symbolic computation based on constraint satisfaction can be useful for information extraction from images. It presents how some limitations of this approach have been overcome by the development of new conceptual tools: arc-consistency with bilevel constraints, weak arc-consistency, a system of complex qualitative spatial relations. The application of these tools to images of various domains (medical images, high resolution satellite images) shows its effectivity.