July 24, 2016

Applying Big Data Technology To Remote Sensing For Species Identification

 Understanding the processes governing ecological systems from local to global scales is crucial to determining how they will respond to and influence environmental, economical and geopolitical issues such as climate change, invasive species, fire hazards, and land use change. To collect the data necessary to model ecological processes across scales the National Ecological Observatory Network (NEON) was built starting in 2012 to conduct intensive monitoring and measurements across the United States. Hundreds of ecological and environmental data products ranging from small local samples to large scale remote sensing using aircraft will be monitored across over 81 different observatory sites. The volume, velocity, and variety of data generated by this effort is far greater than anything being currently collected or analyzed by ecologists. Therefore maximizing the knowledge gained from this data will require bridging the gap between different disciplines including ecology, computer science, statistics, and data science. To help develop interdisciplinary approaches to working with and understanding these data, we propose an applied, multidisciplinary, multi-modal, big data challenge to NIST Data Science Evaluation (DSE) series to be used as a stepping stone, with an initial focus on using a combination of airborne remote sensing data and field measurements of forests to characterize the structure of the plant community at large scales.

NEON sites across the United States

The tasks included in the pilot consist of:
Crown Delineation: determining the size and location of trees from remote sensing data
Alignment: relating different representations of the same object in different data sources.
Classification: determining possible values for an unknown variable based on known variables.

For each task, we define a specific set of input output requirements along with performance evaluation metrics. In what comes below, you can see an example of applying data science techniques for classification task. This approach utilizes data sources from various origins to help enhance classification accuracy.

State-of-the-art species classification techniques utilize remote sensing data such as hyperspectral and LiDAR, along with limited field or lab data if available. Field data collection is a very costly and time consuming process, therefore data driven approaches suffer due to issues such as curse of dimensionality. To circumvent this, rare species are eliminated or grouped together. At times that data driven approaches fail, expert ecologist have better chance at identifying species due to years of experience and studies. In this paper we propose a framework to capture expert knowledge in the form of continuous probabilistic first order logic rules and use it to enhance species identification. Here we show that using a simple knowledge on elevation alone can increase accuracy by about 8%. Along this, we also provide interpretability of how this knowledge pertains to the actual data. This process is in performed in the form of a novel continuous probabilistic first order logic model.

The initial expert rules have been provided by ecologist scientist Dr. Stephanie Ann Bohlman. Initial rules have been stated in plain text format as below and later converted to our pipeline notation that is coherent to Markov Logic Networks (MLN) notations for extensibility. As an example, we analyze the rule provided regarding Acer rubrum (ACRU). It has been stated that ''ACRU is somewhat unlikely to be above 29 m'', and a probability of about 40% has been stated as a ballpark guess of how this rule would hold. We take this rule and convert it to a format that is compatible with our MLN framework as below:

0:4 ∀ p P; ∀s S species(p, ACRU) elevation(p 29)
0:6 ∀ p P; ∀s S species(p, ACRU) elevation(p > 29)

The main reason that we don't follow the main technique proposed by MLN is that MLNs apply to discrete domains where you have a few set of options to choose from and by flipping a variable you sample through the state space in a random walk fashion until either a certain number of iterations have been passed or a threshold accuracy is reached. In our case as we are dealing with continuous data and we don't have the freedom of letting the global distribution of random variables set the weights for each rule. Here, on one hand we have the input global multi-class SVM classifier inputs, then we have the set of probabilistic rules, and finally we have the global objective function that we shall optimize (mutual multi-class classification accuracy). Inference is performed by Bayesian models that related spectral, LiDAR and SVM probabilities. This is to make sure we maintain interpretability of results for ecologists while achieving high accuracy results, otherwise we could just use other classification fusion approaches to merge the results of multiple classifiers.

No comments:

Post a Comment