W7: Data Mining

Organised by Jessica Spate, Eibe Frank, Karina Gibert, Xindong Wu, and Joaquim Comas


Title: Data Mining as a Tool for Environmental Scientists

Authors: JM Spate, K Gibert, M Sˆnchez-Marr, E Frank, J Comas, and I Athanasiadis

Abstract: Over recent years a huge library of data mining algorithms has been developed to tackle a variety of problems in øelds such as medical imaging and network tra±c analysis. Many of these techniques are far more ¡exible than more classical mod- elling approaches and could be usefully applied to data-rich environmental problems. Certain techniques such as Artiøcial Neural Networks, Clustering, Case-Based Rea- soning and more recently Bayesian Decision Networks have found application in environmental modelling while other methods, for example classiøcation and asso- ciation rule extraction, have not yet been taken up on any wide scale. We propose that these and other data mining techniques could be usefully applied to di±cult problems in the øeld. This paper introduces several data mining concepts and brie¡y discusses their application to environmental modelling, where data may be sparse, incomplete, or heterogenous.

Title: Data mining and image segmentation approaches for classifying defoliation in aerial forest imagery

Authors: Kyoko Fukuda, Phillip Pearson

Abstract: Maintaining the health of forest ecosystems by monitoring and managing defoliation is important for the protection of our natural resources and economy. The purpose of this study is to investigate and develop a method for classifying defoliation in aerial imagery using data mining (WEKA, J48 [2]) that requires only a minimal amount of prior knowledge, but is comparable to aerial surveys using human observers. The following approach is taken: 1) using a small training data set, 2) histograms are taken of red, green, blue and greyscale colour channels over relatively large (20x20-pixel) image tiles, and 3) data points consist of only four values: the peaks of the four histograms. The approach is compared with an image segmentation method that detected the border of defoliation precisely and provided contours of defoliation levels, and also tested on smaller tiles (10x10 pixels). The method was tested on imagery of defoliation caused by the mountain pine beetle [1]. Damaged regions are classified into severe (S), moderate (M), and light (L), by colour and proportion of trees killed. Other regions are classified into vegetation (V), ground surface (Surface), and non-attacked (Non). Two to five training data points were randomly selected for each classification. Results using two training data points were found to be encouraging (accuracy of S: 68%; M: 100%; L: 50%; Non: 20%; V: 69%; and Surface: 100%), and three training data points were more comparable with manual defoliation classifications given in [1], and accuracy was improved for L (75%) and Non (50%). Further investigations on patterns in colour channel values improved classifications (S: 59%; M: 100%; L: 100%; Non: 48%; V: 100%; and Surface: 100%). Smaller tiles provided better accuracy for non-attack classification (68%). The low classification accuracy of non-attacked regions may be due to lack of strong identification criteria and confusion with old attacked (grey) regions. The non-attacked region is found to be more heterogeneous, including light green and yellow-green colours. The use of data mining helps reduce the time and knowledge required to design a classifier. The approach investigated here are satisfaction with the image segmentation and aerial surveys using human observers, although the classification accuracy requires further improvement. 1. B.C. MFCFS (2000) Overview Aerial Survey Standards for British Columbia and the Yukon; http://srmwww.gov.bc.ca/risc/pubs/teveg/foresthealth/assets/aerial-2.jpg 2. Witten, I.H. and Frank, E. (2005) Data mining, practical machine learning tools and techniques, 2nd ed., Morgan Kaufmann Publishers, San Francisco.

Title: Finding relevant features for the characterization of the ecological status of human altered streams using a constrained mixture model

Authors: Alfredo Vellido, Joaquim Comas, Raul Cruz, Eugenia Marte

Abstract: The integration of machine learning and statistic techniques may prove to be a useful approach to data-based environmental modelling, by increasing the robustness and reliability of the resulting models. In this study, we use an extension of Generative Topographic Mapping (GTM), a model for data clustering and visualization. It can rightly be characterized as both a statistical constrained mixture of distributions and a probabilistic alternative to Self-Organizing Maps (SOM): a machine learning technique. The extension of the standard GTM consists in a method for the estimation of unsupervised feature relative relevance in terms of the data cluster structure. This method will be applied to the empirical data conforming the knowledge base of an environmental decision support system developed for European project STREAMES. These data, which come from several low-order streams located throughout Europe and Israel, with emphasis on the Mediterranean region, consist of several types of features, including physical, chemical and biological parameters. In this study, we first aim to find which of those features are most relevant to understand and explain the cluster structure of the data. Ecological status is defined in accordance to the current European Water Framework Directive policy and, although more commonly evaluated by means of communities of organisms or habitat descriptors, is here described in terms of stream nutrient retention (a functional ecosystem attribute). Given the high dimensionality of the data, any description of its cluster structure in terms only of the most relevant features should ease the interpretation of the results and facilitate the task of water managers acting on them. A second goal of the study is finding out which variables influence most the ecological status of the streams. Again, this knowledge should help water managers to focus efforts and resources on strategies that minimize negative human impacts on vulnerable low-order streams.

Title: Application of Data Mining Techniques to Obtain Qualitative Models for Agricultural Contaminants in Ground Waters

Authors: Javier Aroba, M. Luisa De la Torre

Abstract: The main objective of the present study is the contrast of the functioning model proposed by Grande et al. (1996)[1] regarding the behaviour of nitrate and other contaminant concentrations in a detritic aquifer undergoing overexploitation and intensive cropping of strawberries and citrus trees. Specifically, the authors intend to contrast this model, which applies classical statistic techniques, such as factorial and correlation analysis, to a data mass resulting from the sampling and analysis of a network of 54 wells distributed across the system’s recharge zone. By studying this model, the existence of a close dependency relationship between nitrate ion concentrations in the saturated zone of the study area and the presence of strawberry crops in the medium was established. Data [1] establish the natural average recharge to the aquifer to be about 100 Hm3/year. However, this estimate is probably too high considering the prolonged drought that has affected the region in the last few years, and the continuous increase in agriculture. The proposed qualitative model to contrast with the proposed in [1], has been obtained by applying fuzzy logic and data mining techniques to the same data mass already used to obtain the model described in [1]. The obtained fuzzy model provide researchers unskilled in data mining techniques an easy and intuitive interpretation that allows immediate qualitative analysis of the data contained in a data mass. As a result, it can be concluded that for the studied sector, the process of nitrate contamination is the almost direct consequence of the development of strawberry crops in the medium, while orange trees hardly contribute to the increase in nitrate concentration in the saturated zone, as already proposed in the model to be contrasted. Keywords: Pollution, Nitrate, Aquifer, Fuzzy logic, Data mining. [1] “Application of factor analysis to the study of contamination in the aquifer system of Ayamonte-Huelva (Spain)”. Ground Water. 34(1): 155-161

Title: A Data Mining Approach to Enhance Knowledge Extraction in Environmental Databases

Authors: Xavier Flores Alsina, Joaquim Comas, Karina Gibert, Miquel Sanchez-Marre, Ignasi Rodriguez Roda

Abstract: An intelligent Environmental Decision Support System (IEDSS) can be defined as an intelligent information system that reduces the time in which decisions are made in an environmental domain, and improves the consistency and quality if those decisions. The fully success of an IEDDS mainly depends on the knowledge embodied, which provides the system with enhanced abilities to reason about the environmental system in a more reliable way A data mining approach to enhance the extraction of knowledge in environmental databases is presented in this paper. This knowledge will be used in IEDSS to make the decisions. This approach combines both clustering and tree induction methods to face unknown, unlabeled and ill defined databases, as environmental databases. First, an identification of characteristic situations is made finding groups of homogenous objects to finally discover a set of rules that can predict the classes found previously. The usefulness of the proposed approach is demonstrated with environmental data set from the simulation results of IWA/COST simulation benchmark plant. Both water quality and operational parameters from different spatial locations are mined by means of clustering and tree induction methods. Different abnormal situations (rain, storm,…) are identified determining the correct number of classes by the k-means algorithm and their main causes recognized inducing a classification tree and a set of rules by means of the algorithm J48. Thus, the system is provided with a valuable information that could be used: to set up an initial library for CBRs, a set heuristic rules for problem diagnosis, for design of control strategies, to identify the cause of abnormal situations.

Title: Comparison of linear and non-linear PLS methods for soft-sensing of an SBR for nutrient removal

Authors: Kris Villez, Dae Sung Lee, Christian Rosen, Peter Vanrolleghem

Abstract: Increasing demands on the performance of wastewater treatment plants (WWTP’s) lead to a search for advanced control strategies, often relying on high-level sensors and actuators. Despite of promising results in research, advanced strategies fail to gain trust in practice. Due to the sensitivity of the biological processes to influent and operational disturbances, operators may be unable to find the causes of faults due to the lack of effective forms of real-time online monitoring. A suitable strategy for on-line monitoring is therefore an essential to enhance biological process control. In this work, a suitable multivariate soft-sensor is searched for to be part of an integrated fault detection and control system for a pilot-scale SBR system. By means of this softsensor the effluent quality can be estimated well before off-line analysis is finished. For this purpose, several multivariate methods are available, including (Linear) PLS, Neural Network Partial Least Squares (NNPLS) and Kernel Partial Least Squares (KPLS). Except for KPLS, non-linear extensions of PLS such as NNPLS require the fitting of non-linear functions. In KPLS no non-linear optimisation is required, while being a method for non-linear PLS modelling. Similar to Kernel Principal Component Analysis (KPCA), the method is based on a non-linear transformation of the process data. If chosen well, the transformation leads to linearised characteristics of the process data. A linear PLS model is then fitted between the transformed process data and the predicted variables. Linear PLS, Neural Network PLS (NNPLS) and KPLS were compared to each other regarding their ability to predict effluent quality data and their computational requirements. While (linear) PLS and NNPLS leads to acceptable prediction, KPLS results in overfitting to the data. This indicates that the main non-linear correlations are to be found between process data and effluent quality data, rather than within the process data. Moreover, the computational requirement of KPLS were large compared to PLS and NNPLS. When comparing PLS and NNPLS to eachother, is was found that NNPLS leads to the best possible prediction while the extra computational requirements are minimal.

Title: On the prediction of the ecological status of human-altered streams and its rule-based interpretation

Authors: Terence A. Etchells, Alfredo Vellido, Eugenia Marti, Paulo J.G. Lisboa, Joaquim Comas

Abstract: This study concerns the analysis of the empirical data from the knowledge base of an environmental decision support system developed for the European project STREAMES. These data, which come from several low-order streams located throughout Europe and Israel, with emphasis on the Mediterranean region, consist of several types of features, including physical, chemical and biological parameters. More specifically, we aim to classify these data according to the ecological status of the streams they correspond to. Ecological status is here defined in accordance to the current European Water Framework Directive policy and is described in terms of stream nutrient retention (a functional ecosystem attribute). In its simplest form, this could be understood as a supervised classification problem. Such approach, though, does not fully account for the cluster structure of the data. This structure should not be ignored, especially when analyzing small data samples, such as the one available for this study. We first aim to explore what modifications in the cluster structure of these data are induced by the addition of information on the ecological status of the corresponding streams. A semi-supervised extension of Generative Topographic Mapping (GTM), a model for data clustering and visualization, is used to this purpose: GTM can be understood as both a statistical constrained mixture of distributions and a probabilistic alternative to the neural network-inspired Self-Organizing Maps (SOM). Secondly, we focus on the problem of classification itself, using supervised neural networks. The interpretability of the obtained classification results is likely to be improved by their description in terms of simple, actionable rules. This is accomplished through the application of Orthogonal Search-based Rule Extraction (OSRE), a novel overlapping rule extraction method. All the newly acquired knowledge should help water managers to focus their efforts on strategies that minimize the negative human impacts on vulnerable low-order streams.

Title: Data mining approaches to explaining aerosol formation

Authors: Saara Hyvonen, Heikki Junninen, Lauri Laakso, Miikka Dal Maso, Tiia Gronholm, Boris Bonn, Petri Keronen, Pasi Aalto, Veijo Hiltunen, Toivo Pohja, Samuli Launiainen, Pertti Hari, Heikki Mannila, Markku

Abstract: Atmospheric aerosol particle formation is frequently observed in various environments. Yet, despite numerous studies, processes behind these so called nucleation events remain unclear. In this work data mining techniques are used to detect factors influencing particle formation. These techniques are applied to a dataset of 8 years of 80 variables collected at the boreal forest station (SMEAR II) in Southern Finland, including air pollutant, weather, gas and particle measurements. In order to understand what causes nucleation we use classification methods together with feature selection. Each day is classified as an event day, when a nucleation event occurs, or as a nonevent day, and looking at which features are selected gives us information on which factors are important for the aerosol formation process. This way we are able to identify two key variables, relative humidity and preexisting aerosol particle surface (condensation sink), capable of explaining 88% of the nucleation events. Nucleation only occurs with low relative humidity and condensation sink values. Different classification methods perform slightly differently, but these two key variables remain unchanged, while including further parameters does not improve the results notably. Using these two variables it is possible to derive a nucleation probability function. This nucleation probability function has been tested on data collected from other sites. The two key varibles are related to mechanisms that prevent nucleation from starting and particles from growing. One reason for the domination of this preventive mechanism could be the existence of more than one mechanism causing nucleation. Another intriguing phenomenon, possibly related to this, is the temporal variation of nucleation events. We have investigated these problems by using classification methods together with clustering and sliding window approaches. We discuss some aspects of these methods and present results obtained by them.

Title: Neural identification of fuzzy anomalies in pressurized water systems

Authors: Joaquin Izquierdo, Rafael Perez, P. Amparo Lopez, Pedro L. Iglesias

Abstract: The objective of a Water Supply System (WSS) is to convey treated water to consumers through a pressurized network of pipes. A number of meters and gauges are used to take continuous or periodic measurements that are sent via a telemetry system to the control and operation centre and used to monitor the network. Using this typically limited number of measures together with demand predictions the state of the system must be assessed. Suitable state estimation is of paramount importance in diagnosing leaks and other anomalies in WSS. But this task can be really cumbersome, if not unattainable, for human operators. The aim of this paper is to explore the possibility for a neural network to perform such a task. For one thing, state estimation of a network is performed by using optimization techniques that minimize the discrepancies between the measures taken by telemetry and the values produced by the mathematical model of the network, which tries to reconcile all the available information. But, for the other, although the model can be completely accurate, the estimation is based on data containing non negligible levels of uncertainty, what definitely influences the precision of the estimated states. The quantification of the uncertainty of the input data (telemetry measures and demand predictions) can be achieved by means of robust estate estimation. By making use of the mathematical model of the network, estimated states together with uncertainty levels, that is to say, fuzzy estimated states, for different anomalous states of the network can be obtained. Also a description of the anomaly associated with such fuzzy state must be stored. The final aim is to train a neural network capable of assessing WSS anomalies associated with particular sets of measurements received by telemetry and demand predictions.

Title: Optimal Modularization of Learning Models in Forecasting Natural Phenomena

Authors: Dimitri Solomatine

Abstract: One of the ways to increase the accuracy of data mining (machine learning) algorithms is to build committee machines. One of the types of such models is a modular model which is comprised of a set of specialized (local) models. Each of them is trained on a subset of the training set and becomes responsible for a particular region of the input space. Many algorithms for allocating such regions to local models typically do this in automatic fashion. In forecasting natural processes, however, domain experts want to bring in more knowledge into such allocation, and to have certain control over the choice of models. This paper presents a number of approaches to building modular models based on various ways of fragmenting the training set and combining the models’ outputs (hard splits, statistically- and deterministically driven soft combinations of models, etc.). An issue of including a domain expert into the modeling process is also discussed, and the new algorithms in the class of model trees (piece-wise linear modular regression models) are presented. The presented algorithms show higher accuracy and transparency if compared to the more traditional “global” learning models. A case study of river flow forecasting is considered.

Title: A framework for spatio-temporal data analysis and hypothesis exploration

Authors: Alexander Campbell, Binh Pham, Yu-Chu Tian

Abstract: We present a general framework for pattern discovery and hypothesis exploration in spatio-temporal data sets that is based on delay-embedding. This is a remarkable method of nonlinear time-series analysis that allows the full phase-space behaviour of a system to be reconstructed from only a single observable (accessible variable). Recent extensions to the theory that focus on a probabilistic interpretation extend its scope and allow practical application to noisy, uncertain and high-dimensional systems. The framework uses these extensions to aid alignment of spatio-temporal sub-models (hypotheses) to empirical data - for example satellite images plus remote-sensing - and to explore modifications consistent with this alignment. The novel aspect of the work is a mechanism for linking global and local dynamics using a holistic spatio-temporal feedback loop. An example framework is devised for an urban based application, transit centric developments, and its utility is demonstrated with real data.

Title: Data mining approaches for monitoring and modeling the invasion of honeylocust tree (Melia azedarach) in El Palmar Nacional Park

Authors: Priscilla Minotti, Ana Scopel, Fernando Ruiz Selmo

Abstract: We explored the use of datamining approaches to develop models that will help El Palmar Nacional Park (Entre Rios, Argentina) managers to control and reduce the woody encroachment of the exotic Melia azedarach tree into the palm savannah . Detecting early stages of Melia invasion using standard image processing techniques has not been suscessful because the spectral signal of young tres is masked within the sorrounding vegetation, and spatial autocorrelation and non linear relations between spectral attributes prevent the use of traditional statistical techniques such as discriminant analysis. We compared the use of machine learning algorithms to relate GPS point data on invasion condition (not invaded, early stage, established) with 54 spectral atributes derived from winter and summer Landsat ETM images. We chose Naives Bayes (NB), a tree layered backpropagation perceptron (ANN), IBk nearest neighbour (KNN), J48 decision tree (Tree) and JRIP (Rules) using WEKA 3.4, as approaches which in terms of methods and results could be rather easily explained to park managers. Data was balanced to reduce non invaded representation.. Attributes were selected using a combination of methods. KNN and Tree decision tree proved to be the best overall methods to classify image spectral attributes into invasion stages based on global measures of error.Considering perfomance indicators for indivual classess, sucha as under ROC, TP rate and FP rate, the algorithms more promising to detect early stages of Melia were KNN and ANN. Confusion matrices for all methods showed that early stages are easily confused with non invaded places but established tree can be classifies with great accuracy. We are currently evaluating the operational use of knn for monitoring current stage for other exotic species and to predict the stage of invasion backwords using historic Landsat imagery.