Spatial Cross-Validation: Concepts, Importance, and Key Steps

What is Spatial Cross-Validation?

Spatial Cross-Validation (SCV) is an essential technique in spatial analysis and species distribution modeling, landscape ecology, and other fields that work with geospatial data. Its main goal is to assess the performance of predictive models while mitigating the issue of spatial autocorrelation, a situation where observations close in space tend to be more related to each other than to those farther away.

Why is Spatial Cross-Validation Important?

Predictive models using spatial data can be influenced by autocorrelation, leading to an overestimation of model performance. Traditional cross-validation methods, such as random k-fold, can inflate predictive ability because the training and test sets may share spatial patterns. To prevent this, spatial cross-validation divides the data into spatial blocks or clusters, ensuring that evaluation better reflects the model’s ability to generalize to new regions.

Types of Spatial Cross-Validation

There are several approaches to implementing spatial cross-validation:

Block Cross-Validation: Data is divided into non-overlapping spatial blocks. The model is trained on some blocks and tested on the remaining ones. This approach is practical when data is spatially clustered and has a high degree of autocorrelation.Example in R:

library(blockCV)
library(sf)

# Load example data
data <- st_read("your_spatial_data.shp")

# Define spatial blocks
blocks <- spatialBlock(speciesData = data, species = "species_column", k = 5)

# Implement cross-validation
results <- lapply(1:5, function(i) {
  train_data <- data[blocks$folds != i, ]
  test_data <- data[blocks$folds == i, ]
  model <- lm(response_variable ~ predictor, data = train_data)
  pred <- predict(model, test_data)
  mean((test_data$response_variable - pred)^2)
})
mean(unlist(results))

For the moment, I am using this one, which has worked for me in some extrapolation analysis: https://github.com/VanBejA/Spatial_Cross_Validation

Leave-One-Out Spatial Cross-Validation (LOO-SCV): To evaluate the model, a single sampling point or a small region is left out at a time. This approach is more appropriate when dealing with dense datasets and assessing the influence of individual observations on the model.Example in R:
```
library(caret)

set.seed(123)
loo_control <- trainControl(method = "LOO")
model_loo <- train(response_variable ~ ., data = data, method = "lm", trControl = loo_control)
model_loo$results
```
Spatial k-fold Cross-Validation: This method is similar to traditional k-fold but ensures that data partitions are spatially independent. It is recommended when balancing subset size while ensuring spatial independence without excessively fragmenting the data.Example in R:
```
library(caret)

set.seed(123)
spatial_kfold_control <- trainControl(method = "cv", number = 5)
model_kfold <- train(response_variable ~ ., data = data, method = "lm", trControl = spatial_kfold_control)
model_kfold$results
```
Buffer Cross-Validation: Excludes a buffer area around training points to prevent overestimation due to spatial proximity. It is recommended when substantial spatial dependence between nearby observations and stricter validation is needed.Example in R:
```
library(blockCV)

buffer_cv <- buffering(speciesData = data, species = "species_column", size = 1000)  # Buffer size in meters
buffer_folds <- buffer_cv$folds
```

Criteria for Selecting a Spatial Cross-Validation Approach

The choice of spatial cross-validation method depends on several factors, including:

Level of spatial autocorrelation: Block or buffer validation is recommended if autocorrelation is high.
Data density: For sparse data, buffer or LOO validation may be more suitable, while for dense data, spatial k-fold is a better option.
Dataset size: For small datasets, LOO can maximize the use of available information, while for large datasets, block or buffer validation may be more feasible.
Study objective: If the goal is to predict in unsampled regions, block validation is usually the best choice, while buffer or LOO may be better for evaluating local variable influence.

Steps to Implement Spatial Cross-Validation

Explore Spatial Data: Analyze the data’s spatial distribution and determine the level of autocorrelation (e.g., using Moran’s I index).
Define a Spatial Partitioning Method: Based on the study’s nature (blocks, buffer, leave-one-out, etc.), select the appropriate strategy.
Split Data into Spatial Subsets: Implement data separation, ensuring that training and validation sets are spatially independent.
Train and Evaluate the Model: Build the model using training data and assess its performance on validation data.
Repeat the Process and Compute Evaluation Metrics: Obtain error statistics such as RMSE, AUC, or R² in each iteration to evaluate model stability.
Compare with Traditional Cross-Validation: Assess whether SCV reduces bias in estimating model predictive performance.

Conclusion

Spatial cross-validation is a crucial tool for improving the robustness and reliability of predictive models in ecology and other spatial sciences. Its application helps mitigate the effects of spatial autocorrelation and provides a more realistic evaluation of the model’s ability to make predictions in unsampled regions. Proper implementation of SCV is key to ensuring valid and useful inferences in studies relying on spatial data.

References

Roberts, D. R., Bahn, V., Ciuti, S., Boyce, M. S., Elith, J., Guillera-Arroita, G., … & Dormann, C. F. (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography, 40(8), 913-929.
Valavi, R., Elith, J., Lahoz-Monfort, J. J., & Guillera-Arroita, G. (2019). BlockCV: An R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. Methods in Ecology and Evolution, 10(2), 225-232.
Hijmans, R. J. (2012). Cross-validation of species distribution models: removing spatial sorting bias and calibration with a null model. Ecology, 93(3), 679-688.

I hope this guide helps you understand and apply spatial cross-validation in your scientific projects!