Development of an integrated machine learning model to improve the secondary inorganic aerosol simulation over the Beijing–Tianjin

Development of an integrated machine learning model to improve the secondary inorganic aerosol simulation over the Beijing–Tianjin–Hebei region

Ding, Ning / Tang, Xiao / Wu, Huangjian / Kong, Lei / Dao, Xu / Wang, Zifa / Zhu, Jiang

Abstract

Abstract Secondary inorganic aerosols (sulfate, nitrate, and ammonium, SNA) are the key components of PM2.5 in China. Accurate and seamless SNA concentration data therefore are important for pollution controls and environmental health risk studies. However, the spatiotemporal characters and changes of SNA are still difficult to accurately estimate due to the shortages of SNA datasets with high accuracy and resolution, as well as the uncertainties in SNA simulations arising from air-quality numerical models. In this study, we developed an integrated model to improve hourly SNA simulations with 5 km spatial resolution over the Beijing–Tianjin–Hebei (BTH) region by integrating the random forest (RF) and Light Gradient Boost Machine (LGBM) using Stacked Generalization. Our model fuses Three-dimensional (3D) numerical simulations from the Weather Research and Forecast model (WRF) and Nested Air Quality Prediction Model System (NAQPMS) with observations of 16 sites from the BTH monitoring network. Three months’ data from January to March 2020 were employed to evaluate the model performance using the cross-validation (CV) method. The results showed that the integrated model provide more accurate simulations of SNA than the 3D numerical model does, with root mean square errors (RMSE) decreased by 33%, 45%, and 35%; correlation coefficient (R) increased by 61%, 28%, and 34%; and Taylor skill score (TSS) increased by 331%, 85%, and 65% for sulfate, nitrate, and ammonium respectively. Moreover, the integrated model showed higher evaluation criteria and more accurate spatiotemporal characteristic compared with the single machine learning (ML) model, especially in heavily polluted area. This study provides a new approach to improve SNA simulations and reveals the potential of ML models for improving aerosol modeling when observational data are scarce.

Highlights • An integrated machine learning model was developed by integrating the RF and LGBM using Stacked Generalization. • Meteorological and pollutant simulation, emission, topographic, and surface observations were used in the integrated model. • The model can well improve the hourly SNA simulations over the Beijing–Tianjin–Hebei (BTH) region. • The integrated model showed better performances than the single model for improving SNA simulation.