KDIR 2013 Abstracts


Full Papers
Paper Nr: 2
Title:

Mapping Text Mining Taxonomies

Authors:

Katja Pfeifer and Eric Peukert

Abstract: Huge amounts of textual information relevant for market analysis, trending or product monitoring can be found on the Web. To make use of that information a number of text mining services were proposed that extract and categorize entities from given text. Such services have individual strengths and weaknesses so that merging results from multiple services can improve quality. To merge results, mappings between service taxonomies are needed since different taxonomies are used for categorizing extracted information. The mappings can potentially be computed by using ontology matching systems. However, the available meta data within most taxonomies is weak so that ontology matching systems currently return insufficient results. In this paper we propose a novel approach to enrich service taxonomies with instance information which is crucial for finding mappings. Based on the found instances we present a novel instance-based matching technique and metric that allows us to automatically identify equal, hierarchical and associative mappings. These mappings can be used for merging results of multiple extraction services. We broadly evaluate our matching approach on real world service taxonomies and compare to state-of-the-art approaches.

Paper Nr: 24
Title:

Data Clustering Validation using Constraints

Authors:

João M. N. Duarte, Ana L. N. Fred and F. Jorge F. Duarte

Abstract: Much attention is being given to the incorporation of constraints into data clustering, mainly expressed in the form of must-link and cannot-link constraints between pairs of domain objects. However, its inclusion in the important clustering validation process was so far disregarded. In this work, we integrate the use of constraints in clustering validation. We propose three approaches to accomplish it: produce a weighted validity score considering a traditional validity index and the constraint satisfaction ratio; learn a new distance function or feature space representation which better suits the constraints, and use it with a validation index; and a combination of the previous. Experimental results in 14 synthetic and real data sets have shown that including the information provided by the constraints increases the performance of the clustering validation process in selecting the best number of clusters.

Paper Nr: 33
Title:

Using Conditional Random Fields with Constraints to Train Support Vector Machines - Locating and Parsing Bibliographic References

Authors:

Sebastian Lindner

Abstract: This paper shows how bibliographic references can be located in HTML and then be separated into fields. First it is demonstrated, how Conditional Random Fields (CRFs) with constraints and prior knowledge about the bibliographic domain can be used to split bibliographic references into fields e.g. authors and title, when only a few labeled training instances are available. For this purpose an algorithm for automatic keyword extraction and a unique set of features and constraints is introduced. Features and the output of this Conditional Random Field (CRF) for tagging bibliographic references, Part Of Speech (POS) analysis and Named Entity Recognition (NER) are then used to find the bibliographic reference section in an article. First, a separation of the HTML document into blocks of consecutive inline elements is done. Then we compare one machine learning approach using a Support Vector Machines (SVM) with another one using a CRF for the reference locating process. In contrast to other reference locating approches, our method can even cope with single reference entries in a document or with multiple reference sections. We show that our reference location process achieves very good results, while the reference tagging approach is able to compete with other state-of-the-art approaches and sometimes even outperforms them.

Paper Nr: 34
Title:

Semantic-based Knowledge Discovery in Biomedical Literature

Authors:

Fatiha Boubekeur, Sabrina Cherdioui and Yassine Djouadi

Abstract: .

Paper Nr: 37
Title:

Semi-supervised Clustering with Example Clusters

Authors:

Celine Vens, Bart Verstrynge and Hendrik Blockeel

Abstract: We consider the following problem: Given a set of data and one or more examples of clusters, find a clustering of the whole data set that is consistent with the given clusters. This is essentially a semi-supervised clustering problem, but different from those that have been studied until now. We argue that it occurs frequently in practice, but despite this, none of the existing methods can handle it well. We present a new method that specifically targets this type of problem. We show that the method works better than standard methods and identify opportunities for further improvement.

Paper Nr: 43
Title:

Semantic Collaborative Filtering for Learning Objects Recommendation

Authors:

Lamia Berkani and Omar Nouali

Abstract: The present paper proposes a personalized recommendation approach of learning objects (LOs) within an online Community of Practice (CoP). Three strategies of recommendation have been proposed: (1) a semantic filtering (SemF) by member’s interests; (2) a collaborative filtering (CF) based on the member’s expertise level; and (3) a semantic collaborative filtering combining in different ways the two approaches. The expertise level of a member is calculated in relation to all of his domains of expertise using the domain knowledge ontology (DKOnto). A similarity measure is proposed based on a set of rules which cover all the possible cases for the relative positions of two domains in DKOnto. In order to illustrate our work, some preliminary results of experimentation have been presented.

Short Papers
Paper Nr: 4
Title:

Knowledge Presentation based on Multi-dimension Model for Measuring Planning in Digital Manufacturing

Authors:

Xiaoqing Tang and Zhehan Chen

Abstract: Digital measurement technology has been widely employed in product manufacturing process. In general, a measuring process is planned based on human knowledge in planning strategies, measuring regulations, de-vices and instruments, measuring operations and historical data. Knowledge-supported measuring planning makes the process formatted, and enables manufacturers to improve product quality and reduce manufacturing cost. Therefore, accumulating, presenting and modeling the measuring data, information and expertise knowledge from engineering sectors, which can provide a foundation for discovering and reusing knowledge of measuring process, are crucial for planning and optimizing a measurement plan. In order to improve measurement plans based on expertise knowledge, a general measurement space (GMS) model of measuring process is proposed. The model makes the attributes in three dimensions to describe and classify multi-source and heterogeneous knowledge in the measuring process. The methodology for integrating and expressing measuring process knowledge is then discussed, in order to support the storage, management and analysis of structured knowledge data based on programs. Finally, the GMS’s characteristics matrix is constructed, providing a feasible way to evaluate measurement plans based on measuring process knowledge.

Paper Nr: 15
Title:

Personalized Recommendation and Explanation by using Keyphrases Automatically extracted from Scientific Literature

Authors:

Dario De Nart, Carlo Tasso and Felice Ferrara

Abstract: Recommender systems are commonly used for discovering potentially relevant papers in huge collections of scientific documents. In this paper we propose a concept-based recommender system where relevant concepts are automatically extracted from scientific resources in order to both model user interests and generate recommendations. Differently from other work in the literature, our concept-based recommender system does not depend on specific domain ontologies and, on the other hand, is based on an unsupervised, domain independent keyphrase extraction algorithm that identifies relevant concepts included in a scientific paper. This semantic-oriented approach allows the user to easily inspect and modify his user model and to effectively justify the proposed recommendations by showing the main concepts included in the suggested papers.

Paper Nr: 16
Title:

Using Domain Knowledge in Association Rules Mining - Case Study

Authors:

Jan Rauch and Milan Šimůnek

Abstract: A case study concerning an approach to application of domain knowledge in association rule mining is presented. Association rules are understood as general relations of two general Boolean attributes derived from columns of an analysed data matrix. Interesting items of domain knowledge are expressed in an intuitive form distinct from association rules. Each particular pattern of domain knowledge is mapped onto a set of all association rules which can be considered as its consequences. These sets are used when interpreting results of data mining procedure. Deduction rules concerning association rules are applied.

Paper Nr: 17
Title:

The Disulfide Connectivity Prediction with Support Vector Machine and Behavior Knowledge Space

Authors:

Hong-Yu Chen, Chang-Biau Yang, Kuo-Tsung Tseng and Chiou-Yi Hor

Abstract: A disulfide bond, formed by two oxidized cysteines, plays an important role in the protein folding and structure stability, and it may regulate protein functions. The disulfide connectivity prediction problem is to reveal the correct information of disulfide connectivity in the target protein. It is difficult because the number of possible patterns grows rapidly with respect to the number of cysteines. In this paper, we discover some rules to discriminate the patterns with high accuracy in various methods. Then, we propose the pattern-wise and pair-wise BKS (behavior knowledge space) methods to fuse multiple classifiers constructed by the SVM (support vector machine) methods. Furthermore, we combine the CSP (cysteine separation profile) method to form our hybrid method. The prediction accuracy of our hybrid method in SP39 dataset with 4-fold cross-validation is increased to 69.1%, which is better than the best previous result 65.9%.

Paper Nr: 25
Title:

A Multiple Instance Learning Approach to Image Annotation with Saliency Map

Authors:

Tran Phuong Nhung, Cam-Tu Nguyen, Jinhee Chun, Ha Vu Le and Takeshi Tokuyama

Abstract: This paper presents a novel approach to image annotation based on multi-instance learning (MIL) and saliency map. Image Annotation is an automatic process of assigning labels to images so as to obtain semantic retrieval of images. This problem is often ambiguous as a label is given to the whole image while it may only corresponds to a small region in the image. As a result, MIL methods are suitable solutions to resolve the ambiguities during learning. On the other hand, saliency detection aims at detecting foreground/background regions in images. Once we obtain this information, labels and image regions can be aligned better, i.e., foreground labels (background labels) are more sensitive to foreground areas (background areas). Our proposed method, which is based on an ensemble of MIL classifiers from two views (background/foreground), improves annotation performance in comparison to baseline methods that do not exploit saliency information.

Paper Nr: 48
Title:

Automatic Image Annotation with Low-level Features and Conditional Random Fields

Authors:

Alexandra Melnichenko and Andrey Bronevich

Abstract: This work is devoted to the problem of automatic image annotation by analyzing image low-level characteristics. This problem consists in assigning words of a natural language to an arbitrary image by analyzing image low-level characteristics without any other additional information. Automatic image annotation could be useful for extraction of high-level semantic information from images, organizing huge image bases and performing search by text query. We propose the general annotation scheme consisting of the three stages. First we extract several types of the low-level features from the images. After that we compute secondary features using clustering technique. The automatic annotation is produced finally by applying Conditional Random Field to the secondary features.

Paper Nr: 53
Title:

Text Simplification for Enhanced Readability

Authors:

Siddhartha Banerjee, Nitin Kumar and C. E. Veni Madhavan

Abstract: Our goal is to perform automatic simplification of a piece of text to enhance readability. We combine the two processes of summarization and simplification on the input text to effect improvement. We mimic the human acts of incremental learning and progressive refinement. The steps are based on: (i) phrasal units in the parse tree, yield clues (handles) on paraphrasing at a local word/phrase level for simplification, (ii) phrasal units also provide the means for extracting segments of a prototype summary, (iii) dependency lists provide the coherence structures for refining a prototypical summary. A validation and evaluation of a paraphrased text can be carried out by two methodologies: (a) standardized systems of readability, precision and recall measures, (b) human assessments. Our position is that a combined paraphrasing as above, both at lexical (word or phrase) level and a linguistic-semantic (parse tree, dependencies) level, would lead to better readability scores than either approach performed separately.

Posters
Paper Nr: 7
Title:

Feature Selection by Rank Aggregation and Genetic Algorithms

Authors:

Waad Bouaguel, Afef Ben Brahim and Mohamed Limam

Abstract: Feature selection consists on selecting relevant features in order to focus the learning search. A simple and efficient setting for feature selection is to rank the features with respect to their relevance. When several rankers are applied to the same data set, their outputs are often different. Combining preference lists from those individual rankers into a single better ranking is known as rank aggregation. In this study, we develop a method to combine a set of ordered lists of feature based on an optimization function and genetic algorithm. We compare the performance of the proposed approach to that of well-known methods. Experiments show that our algorithm improves the prediction accuracy compared to single feature selection algorithms or traditional rank aggregation techniques.

Paper Nr: 8
Title:

An Approach based on Adaptive Decision Tree for Land Cover Change Prediction in Satellite Images

Authors:

Ahlem Ferchichi, Wadii Boulila and Riadh Farah

Abstract: Decision tree (DT) prediction algorithms have significant potential for remote sensing data prediction. This paper presents an advanced approach for land-cover change prediction in remote-sensing imagery. Several methods for decision tree change prediction have been considered: probabilistic DT, belief DT, fuzzy DT, and possibilistic DT. The aim of this study is to provide an approach based on adaptive DT to predict land cover changes and to take into account several types of imperfection related to satellite images such as: uncertainty, imprecision, vagueness, conflict, ambiguity, etc. The proposed approach applies an artificial neural network (ANN) model to choose the appropriate gain formula to be applied on each DT node. The considered approach is validated using satellite images representing the Saint-Paul region, commune of Reunion Island. Results show good performances of the proposed framework in predicting change for the urban zone.

Paper Nr: 14
Title:

Toward a Neural Aggregated Search Model for Semi-structured Documents

Authors:

F. Z. Bessai-Mechmache

Abstract: One of the main issues in aggregated search for XML documents is to select the relevant elements for information need. Our objective is to gather in same aggregate relevant elements that can belong to different parts of XML document and that are semantically related. To do this, we propose a neural aggregated search model using Kohonen self-organizing maps. Kohonen self-organizing map lets classification of XML elements producing density map that form the foundations of our model.

Paper Nr: 18
Title:

A Cognitive Reference based Model for Learning Compositional Hierarchies with Whole-composite Tags

Authors:

Anshuman Saxena, Ashish Bindal and Alain Wegmann

Abstract: A compositional hierarchy is the default organization of knowledge acquired for the purpose of specifying the design requirements of a service. Existing methods for learning compositional hierarchies from natural language text, interpret composition as an exclusively propositional form of part-whole relations. Nevertheless, the lexico-syntactic patterns used to identify the occurrence of part-whole relations fail to decode the experientially grounded information, which is very often embedded in various acts of natural language expression, e.g. construction and delivery. The basic idea is to take a situated view of conceptualization and model composition as the cognitive act of invoking one category to refer to another. Mutually interdependent set of categories are considered conceptually inseparable and assigned an independent level of abstraction in the hierarchy. Presence of such levels in the compositional hierarchy highlight the need to model these categories as a unified-whole wherein they can only be characterized in the context of the behavior of the set as a whole. We adopt an object-oriented representation approach that models categories as entities and relations as cognitive references inferred from syntactic dependencies. The resulting digraph is then analyzed for cyclic references, which are resolved by introducing an additional level of abstraction for each cycle.

Paper Nr: 19
Title:

Relevant Information Discovery in Microblogs - Combining Post’s Features and Author’s Features to Improve Search Results

Authors:

Soumaya Cherichi and Rim Faiz

Abstract: The rapid growth of online data due to the internet and the widespread use of large databases have resulted in increased demand for knowledge discovery methodologies. However, if the development of information technology has made it possible to store larger data volumes at lower cost, quantity and quality of information provided to users has in turn, changed little. Companies more than ever need to transform data into valuable knowledge directly. The colossal size of data and user demand poses a challenge to the scientific community to be able to offer effective tools for Information Retrieval and Knowledge Discovery. Several works have proposed criteria for tweets search, but, this area is still not well exploited, consequently, search results are irrelevant. In this paper, we propose new features such as audience and RetweetRank. We investigate the impact of these criteria on the search’s results for relevant information. Finally, we propose a new metric to improve the results of the searches inmicroblogs. More accurately, we propose a research model that combines content relevance, tweet relevance and author relevance. Each type of relevance is characterized by a set of criteria such as audience to assess the relevance of the author, Out Of Vobulary to measure the relevance of content and others. To evaluate our model, we built a knowledge management system. We used a corpus of subjective tweets talking about Tunisian actualities in 2012.

Paper Nr: 20
Title:

Extraction Student Dropout Patterns with Data Mining Techniques in Undergraduate Programs

Authors:

Ricardo Timarán Pereira, Andrés Calderón Romero and Javier Jiménez Toledo

Abstract: The first results of the research project that aims to identify patterns of student dropout from socioeconomic, academic, disciplinary and institutional data of students from undergraduate programs at the University of Nariño from Pasto city (Colombia), using data mining techniques are presented. Built a data repository with the records of students who were admitted in the period from the first half of 2004 and the second semester of 2006. Three complete cohorts were analyzed with an observation period of six years until 2011. Socioeconomic and academic student dropout profiles were discovered using classification technique based on decision trees. The knowledge generated will support effective decision-making of university staff focused to develop policies and strategies related to student retention programs that are currently set.

Paper Nr: 23
Title:

Trajectory Pattern Mining in Practice - Algorithms for Mining Flock Patterns from Trajectories

Authors:

Xiaoliang Geng, Takeaki Uno and Hiroki Arimura

Abstract: In this paper, we implement recent theoretical progress of depth-first algorithms for mining flock pat-terns (Arimura et al., 2013) based on depth-first frequent itemset mining approach, such as Eclat (Zaki, 2000) or LCM (Uno et al., 2004). Flock patterns are a class of spatio-temporal patterns that represent a groups of moving objects close each other in a given time segment (Gudmundsson and van Kreveld, Proc. ACM GIS’06; Benkert, Gudmundsson, Hubner, Wolle, Computational Geometry, 41:11, 2008). We implemented two extension of a basic algorithm, one for a class of closed patterns, called rightward length-maximal flock patterns, and the other with a speed-up technique using geometric indexes. To evalute these extensions, we ran experiments on synthesis datasets. The experiments demonstrate that the modified algorithms with the above extensions are several order of magnitude faster than the original algorithm in most parameter settings.

Paper Nr: 29
Title:

Clustering and Classifying Text Documents - A Revisit to Tagging Integration Methods

Authors:

Elisabete Cunha, Álvaro Figueira and Óscar Mealha

Abstract: In this paper we analyze and discuss two methods that are based on the traditional k-means for document clustering and that feature integration of social tags in the process. The first one allows the integration of tags directly into a Vector Space Model, and the second one proposes the integration of tags in order to select the initial seeds. We created a predictive model for the impact of the tags’ integration in both models, and compared the two methods using the traditional k-means++ and the novel k-C algorithm. To compare the results, we propose a new internal measure, allowing the computation of the cluster compactness. The experimental results indicate that the careful selection of seeds on the k-C algorithm present better results to those obtained with the k-means++, with and without integration of tags.

Paper Nr: 36
Title:

Cross-domain Sentiment Classification using an Adapted Naïve Bayes Approach and Features Derived from Syntax Trees

Authors:

Srilaxmi Cheeti, Ana Stanescu and Doina Caragea

Abstract: Online product reviews contain information that can assist in the decision making process of new customers looking for various products. To assist customers, supervised learning algorithms can be used to categorize the reviews as either positive or negative, if large amounts of labeled data are available. However, some domains have few or no labeled instances (i.e., reviews), yet a large number of unlabeled instances. Therefore, domain adaptation algorithms that can leverage the knowledge from a source domain to label reviews from a target domain are needed. We address the problem of classifying product reviews using domain adaptation algorithms, in particular, an Adapted Naïve Bayes classifier, and features derived from syntax trees. Our experiments on several cross-domain product review datasets show that this approach produces accurate domain adaptation classifiers for the sentiment classification task.

Paper Nr: 38
Title:

Analysis of Mexican Research Production - Exploring a Scientifical Database

Authors:

Silvia B. González Brambila, Mihaela Juganaru-Mathieu and Claudia N. González-Brambila

Abstract: This paper presents an exploring analysis of the research activity of a country using ISI web of Science Collection. We decided to focus the work on Mexican research in computer science. The aim of this text mining work is to extract the main direction in this scientific field. The focal exploring axe is: clustering. We have done two folds analysis: the first one on frequency representation of the extracted terms, and the second, much larger and difficult, on mining the document representations with the aim of finding clusters of documents, using the most used terms in the title. The cluster algorithms applied were hierarchical, kmeans, DIANA, SOM, SOTA, PAM, AGNES and model. Experiments with different number of terms and with the complete dataset were realized, but results were not satisfactory. We conclude that the best model for this type of analysis is model based, because it gives a better classification, but still it needs better performance algorithms. Results show that very few areas are developed by Mexicans.

Paper Nr: 42
Title:

Recommending the Right Activities based on the Needs of each Student

Authors:

Elias de Oliveira, Márcia Gonçalves Oliveira and Patrick Marques Ciarelli

Abstract: Personalization is more than ever the must feature’s requirement a system needs to comply this days. One can find in many areas online systems which have as a goal to provide each individual user with their potential needs, or desire. To achieve this goal they need to rely on a good recommendation system. Hence, recommendation systems must work under the assumption that one’s need, could also be applied to someone else who has similar desire, tastes, or necessities. So, we present in this paper a system for recommending students extra activities accordingly to their individual needs. The additional assumption is that a promptly reply and tailored guidance in each step of the way of their learning process improve their chances of success. We propose the use of the kNN algorithm to assign activities to students as much similar as possible an expert would as well assign. The results are promising as we are able to mimic human decisions 90.0% of the time.

Paper Nr: 47
Title:

SmartNews: Bringing Order into Comments Chaos

Authors:

Marina Litvak and Leon Matz

Abstract: Various news sites exist today where internet audience can read the most recent news and see what other people think about. Most sites do not organize comments well and do not filter irrelevant content. Due to this limitation, readers who are interested to know other people’s opinion regarding any specific topic, have to manually follow relevant comments, reading and filtering a lot of irrelevant text. In this work, we introduce a new approach for retrieving and ranking the relevant comments for a given paragraph of news article and vice versa. We use Topic-Sensitive PageRank for ranking comments/paragraphs relevant for a user-specified paragraph/comment. The browser extension implementing our approach (called SmartNews) for Yahoo! News is publicly available.

Paper Nr: 54
Title:

On the Extension of k-Means for Overlapping Clustering - Average or Sum of Clusters’ Representatives?

Authors:

Chiheb-Eddine Ben N'Cir and Nadia Essoussi

Abstract: Clustering is an unsupervised learning technique which aims to fit structures for unlabeled data sets. Identifying non disjoint groups is an important issue in clustering. This issue arises naturally because many real life applications need to assign each observation to one or several clusters. To deal with this problem, recent proposed methods are based on theoretical, rather than heuristic, model and introduce overlaps in their optimized criteria. In order to model overlaps between clusters, some of these methods use the average of clusters’ prototypes while other methods are based on the sum of clusters’ prototypes. The use of SUM or AVERAGE can have significant impact on the theoretical validity of the method and affects induced patterns. Therefore, we study in this paper patterns induced by these approaches through the comparison of patterns induced by Overlapping k-means (OKM) and Alternating Least Square (ALS) methods which generalize k-means for overlapping clustering and are based on AVERAGE and SUM approaches respectively.

Paper Nr: 56
Title:

i-SLOD: Towards an Infrastructure for Enabling the Dissemination and Analysis of Sentiment Data

Authors:

Rafael Berlanga, Dolores Mª Llidó, Lisette García, Victoria Nebot, María José Aramburu and Ismael Sanz

Abstract: This paper proposes a new data infrastructure for massive opinion analysis, called i-SLOD, from a Business Intelligence (BI) perspective. This infrastructure aims to allow analysts to re-use the existing review data about products and services publicly available in the Web. It should also take advantage from the external relationships of i-SLOD data in order to perform new exploratory analyses now unfeasible with traditional BI tools. We consider the adoption of Linked Open Data (LOD) technology to build this infrastructure. In this way, i-SLOD data will be published as distributed linked open data by using the RDF and OWL formats. Moreover, we propose to apply automatic semantic annotation to perform the basic tasks in i-SLOD, mainly the extraction of opinion facts from raw text, and linking opinion data to the i-SLOD and other related LOD datasets.