**Authors:** Roma Tauler, Pentti Paatero and Philip Hopke with the assistance of Ronald C. Henry, Cliff Spiegelman, Eun Sug Park, and Richard L. Poirot

**Abstract:** Current approaches and recent developments and software related with multivariate factor analysis and related methods in the analysis of environmental data for the identification, resolution and apportionment of contamination sources are discussed and compared

**Authors:** Cliff Spiegelman, Eun Sug Park, Byron Gajewski

**Abstract:** Air pollution monitoring stations frequently collect many pollutants. Using multivariate pollution measurements, receptor modelers estimate the contributions and the major pollution source profiles that contribute pollution to the receptor. The modeling is nonlinear, and uses partially confirmatory factor analysis approaches. Receptor modeling is an accepted and increasingly often used technique for the US EPA. This talk discusses jackknife approaches to assess the accuracy of receptor models. This includes new data based estimates for bias and standard errors.

**Authors:** Mar Viana, Jon Zabalza, Xavier Querol, Andres Alastuey, Jesus Miguel Santamaria, Jose Inaki Gil, Marina Menendez, Philip K. Hopke

**Abstract:** The effective reduction of ambient levels of atmospheric particulate matter (PM) requires the precise identification and characterization of its emission sources. A number of source apportionment techniques are currently available (PMF, ME, UNMIX, PCA-MLRA), but relatively few studies have evaluated the differences and similarities between the methods. Our study presents a comparative analysis of the results obtained from the application of PMF (Positive Matrix Factorization) and PCA-MLRA (Principal Component Analysis, coupled with Multi-Linear Regression) to one dataset containing the chemical composition of PM2.5 at an industrial site in Northern Spain. Six sources of PM2.5 were identified by PCA whereas 7 were obtained by PMF, out of which sources 1-5 coincided for both methods (sea-spray, mineral, steel manufacture, pigment manufacture and regional-scale circulations). Source 6 was identified as traffic with PCA, whereas in the case of PMF it was defined as “local” due to the fact that it includes not only traffic but also other local-scale anthropogenic sources. Source 7 (secondary sulfate and nitrate) was only observed with PMF. The correlation between observed and predicted mass was high in both cases (R2=0.87 for PCA-MLRA, 0.85 for PMF). Mean annual contributions from the different sources were quite similar between both methods, showing only minor differences which ranged between 1-3% for each source. The correlation between the contributions resolved by the two methods for sources 1-5 ranged from R2=0.71 to 0.91. The unidentified fraction in the PMF analysis was smaller than the PCA-MLRA values (9% with PCA-MLRA vs. 2% with PMF) due to the identification of an additional source.

**Authors:** Richard Poirot, Rudolf Husar

**Abstract:** Ensemble backward trajectory techniques have been employed over the past several decades to identify regional origins of air pollutants subject to synoptic-scale atmospheric transport. The Combined Aerosol Trajectory Tools (CATT) is a recently developed web-based application employing “datafed” architecture (http://datafed.net), which allows users to conduct a wide range of single-site or multi-site ensemble trajectory analyses with minimal time and effort. CATT is currently functional for all aerosol chemistry data (about 35 species) from each of more than 350 ambient monitoring sites in the urban US EPA STN (Speciation Trends Network) and the rural IMPROVE (Interagency Monitoring of Protected Visual Environments) networks. These aerosol data and associated daily ATAD model 5-day back trajectories cover the entire period of record from the late 1980’s at some sites through the end of 2004, and are periodically updated as new data become available. CATT users may also submit their own measured or modeled data files to datafed.net for subsequent CATT analyses, if those data are for similar locations and time periods. For example any results from the Positive Matrix Factorization (PMF) or Unmix receptor models obtained from analyzing any IMPROVE or STN data can be submitted to CATT for ensemble trajectory analyses. This paper will summarize the currently available CATT analysis tools with applications to both the raw species data as well as to selected PMF and Unmix model results.

**Authors:** Roma Tauler, Emma Pera-Trepat

**Abstract:** Analysis of samples obtained in environmental monitoring programs by means of modern instrumental techniques produce huge amounts of physical parameters and chemical concentration values spread at multiple geographical sites and time periods. Moreover, the content of these chemicals may be also estimated at different environmental compartments (i.e. air, water, sediments, biota...). These large data sets can be organized in data structures of different complexity, ranging from two-way data structures (data tables or data matrices) to more complex three-way and multiway data structures. These data structures may be then analysed by means of bilinear, trilinear or multilinear model based methods. In particular, in this communication, we will present the application of Multivariate Curve Resolution Alternating Least Squares (MCR-ALS) to the analysis of environmental monitoring data sets for the purpose of identification, resolution and apportionment of main contamination sources operating over a particular geographical region and during a time period (geographical and temporal source distributions). Special attention will be paid to different topics like: a) effect of different data pretreatment methods; b) rotation ambiguities in MCR-ALS results; and c) effect of errors in MCR-ALS estimations.

**Authors:** Ronald Henry

**Abstract:** Like all forms of regression, nonparametric regression (also known as kernel smoothing) estimates the expected value of a dependent variable given one or more predictor variables. Unlike other forms of regression, however, no assumptions are made as to the functional form of the relationship between the dependent and predictor variables or the statistical distribution of the error in the variables. The method works by taking a weighted moving average of the dependent variable over the range of the predictor variables. The weighting function or kernel is usually a Gaussian function or a simple quadratic function. A method is given for source apportionment of local sources of air pollution by nonparametric regression of the concentration of a pollutant on wind speed and direction. It involves finding the empirical joint probability distribution of wind speed and direction by the kernel density estimation, which is very closely related to nonparametric regression. Some other uses of nonparametric regression for exploratory data analysis are also described. Examples are given of nonparametric regression source apportionment applied to data from Hong Kong and southern California.

**Authors:** Pentti Paatero, Shelly Eberly, Philip K. Hopke

**Abstract:** Recently, presence of noisy variables in factor analysis of multivariate data sets has been analyzed by Paatero and Hopke (Analytica Chimica Acta 490 (2003) 277-289). It was demonstrated that exceptionally noisy variables will make the overall results worse, unless these variables are downweighted from their default weights. It was also shown that the default method of performing PCA suffers especially badly from such noisy columns in the matrix. This talk examines the noisy-variables problem in more detail. Rules-of-thumb are derived for downweighting or omitting variables (columns) that are more noisy than the majority of variables. These rules also apply for downweighting/omitting low-concentration rows of the matrix. Special downweighting rules are discussed for the situation when the source of prime interest is known not to occur in certain samples, e.g. because of wind directions. This talk also examines the problem of rotational ambiguity in the light of new results: there is less ambiguity if the time series factors (columns of G factor matrix) do contain a wide spread of large and small values. Both aspects are important for cost-conscious planning of experiments. One should not pay for obtaining concentrations of such elements (silver, say) that rarely exceed their detection limits (DL). If a trade-off is possible, one should rather obtain a small number of high-quality measurements than a large number of near-DL or below-DL measurements. In order to maximize information content and minimize rotational ambiguity, stratified sampling (well known in social sciences) should be considered. Or, if sampling is low-cost and continuous, one might optimize cost-efficiency by only analyzing a stratified subset of all collected samples.

**Authors:** Philip K. Hopke, Pentti Paatero, Shelly Eberly

**Abstract:** Positive matrix factorization (PMF) has become widely used in the analysis of environmental data with an emphasis on airborne particulate matter (PM) composition. This focus supports the environmental planning process in the United States that state and local air quality agencies perform in order to improve air quality. It is necessary to identify the sources of airborne PM and apportion the mass to those sources. To provide PMF more widely, the EPA has undertaken the development of a GUI-based program to implement this explicit least-squares approach to factor analysis. As part of this development, an approach was needed to provide estimates of the uncertainty in the estimated parameters. In factor analysis, the error estimation is complicated by the rotational ambiguity that exists in most solutions as well as needing to propagate the measurement errors. Bootstrapping is an effective way of estimating the effect of the measurement uncertainty and has previously been implemented in the alternate factor analysis model, Unmix. In order to reflect some of the rotational ambiguity, random variations in the elements of one of the derived matrices are forced in the solution and retained if they do not unduly increase the objective function for the problem. The details of this approach will be discussed in more detail and explored with both simulated and real data.