Newsroom
Pest recognition is a key task in smart agriculture, but it remains challenging in real-world environments. Pest species are highly diverse and often visually similar, and collecting high-quality field data is difficult. These factors limit the performance of traditional vision-based approaches in practical applications.
In a study submitted on IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 (CVPR 2026), a research team led by Prof. ZHANG Jie and Prof. XIE Chengjun from the Hefei Institutes of Physical Science of the Chinese Academy of Sciences proposed a multimodal pest recognition framework, PestVL-Net.
PestVL-Net combines visual and textual information for pest identification. On the visual side, this framework focuses on key regions of an image, which helps capture subtle differences in shape and texture. On the language side, structured pest descriptions derived from agricultural expertise and multimodal language models are introduced. These descriptions, combined with visual features, helps recognize pest species with subtle visual differences.
Experiments on multiple datasets including newly constructed pest datasets and public benchmarks showed that PestVL-Net consistently outperformed existing approaches, with accuracy reaching around 88%-90%. The contribution of each component in the framework was confirmed.
"We are not only 'seeing' them, but also 'describing' them," said ZHANG Jie, one author of this study. This framework provides a practical approach for smart farming, precision agriculture, and crop protection.

Feature Map Visualization of Different Modules (Image by ZHANG Jie)