Agentic Vision
Agentic vision refers to AI systems that go beyond passive image understanding to autonomously plan and execute multi-step tasks that involve visual perception. A standard VLM answers questions about an image. An agentic vision system can look at a user interface screenshot, decide what button to click, observe the result, and continue navigating until the task is complete. The shift is from describe what you see to act on what you see.
Agentic vision systems combine VLMs with planning, memory, and tool use. Vision-Language-Action models (VLAs) like Google's RT-2 and NVIDIA's GR00T directly output robot actions from visual input. GUI agents like CogAgent and SeeClick use VLMs to navigate computer interfaces by interpreting screenshots. The general architecture is a perception-reasoning-action loop: the VLM perceives the visual scene, a reasoning module (often the same LLM) plans the next step, and an action module executes it (clicking, typing, moving a robot arm). Memory enables multi-step tasks where the agent must remember what it has already done.
Agentic vision is emerging in robotic manipulation (warehouse picking, assembly), autonomous GUI testing (navigating apps by looking at screens), visual inspection workflows (an agent that decides which areas to examine more closely and adjusts camera settings), and document processing pipelines where the agent navigates multi-page documents. CVPR 2026 has a dedicated workshop on visual agents, signaling the field's rapid growth.


