|NEON sites across the United States|
The tasks included in the pilot consist of:
Crown Delineation: determining the size and location of trees from remote sensing data
Alignment: relating different representations of the same object in different data sources.
Classification: determining possible values for an unknown variable based on known variables.
For each task, we define a specific set of input output requirements along with performance evaluation metrics. In what comes below, you can see an example of applying data science techniques for classification task. This approach utilizes data sources from various origins to help enhance classification accuracy.
State-of-the-art species classification techniques utilize remote sensing data such as hyperspectral and LiDAR, along with limited field or lab data if available. Field data collection is a very costly and time consuming process, therefore data driven approaches suffer due to issues such as curse of dimensionality. To circumvent this, rare species are eliminated or grouped together. At times that data driven approaches fail, expert ecologist have better chance at identifying species due to years of experience and studies. In this paper we propose a framework to capture expert knowledge in the form of continuous probabilistic first order logic rules and use it to enhance species identification. Here we show that using a simple knowledge on elevation alone can increase accuracy by about 8%. Along this, we also provide interpretability of how this knowledge pertains to the actual data. This process is in performed in the form of a novel continuous probabilistic first order logic model.
The initial expert rules have been provided by ecologist scientist Dr. Stephanie Ann Bohlman. Initial rules have been stated in plain text format as below and later converted to our pipeline notation that is coherent to Markov Logic Networks (MLN) notations for extensibility. As an example, we analyze the rule provided regarding Acer rubrum (ACRU). It has been stated that ''ACRU is somewhat unlikely to be above 29 m'', and a probability of about 40% has been stated as a ballpark guess of how this rule would hold. We take this rule and convert it to a format that is compatible with our MLN framework as below:
0:4 ∀ p ∈ P; ∀s ∈ S species(p, ACRU) → elevation(p ≤ 29)
0:6 ∀ p ∈ P; ∀s ∈ S species(p, ACRU) → elevation(p > 29)
The main reason that we don't follow the main technique proposed by MLN is that MLNs apply to discrete domains where you have a few set of options to choose from and by flipping a variable you sample through the state space in a random walk fashion until either a certain number of iterations have been passed or a threshold accuracy is reached. In our case as we are dealing with continuous data and we don't have the freedom of letting the global distribution of random variables set the weights for each rule. Here, on one hand we have the input global multi-class SVM classifier inputs, then we have the set of probabilistic rules, and finally we have the global objective function that we shall optimize (mutual multi-class classification accuracy). Inference is performed by Bayesian models that related spectral, LiDAR and SVM probabilities. This is to make sure we maintain interpretability of results for ecologists while achieving high accuracy results, otherwise we could just use other classification fusion approaches to merge the results of multiple classifiers.