VISTA - Visual Intelligence State-Tree Architecture

ROS 2 VLM-to-Behavior-Tree robotics middleware for local perception and reactive control.

Status: Work in progress

VISTA is a ROS 2 framework for connecting on-device Vision-Language Model perception with deterministic Behavior Tree control. The repository describes a decoupled two-loop architecture: a slower perception loop analyzes camera frames through a local Gemma VLM served by llama.cpp, while a faster Behavior Tree loop consumes structured state and selects robot behaviors without blocking on model inference.

The project is ideated and maintained by Andrea Ricciardi. My contribution is limited to the Behavior Tree side of the system: the decision structure, custom BT nodes, blackboard integration, and the runtime link between VLM state and BT control logic.

Technical Stack

ROS 2 Jazzy C++17 BehaviorTree.CPP v4 rclcpp components OpenCV cv_bridge Gemma VLM llama.cpp YAML

System Architecture

VISTA runs two ROS 2 composable nodes in a multi-threaded component container. GemmaPerceptionNode subscribes to a camera image topic, encodes frames as JPEG, calls a local OpenAI-compatible llama.cpp chat-completions endpoint, parses JSON perception output, and publishes a custom /vlm_data message. BehaviorTreeNode subscribes to that topic, updates a BehaviorTree.CPP blackboard, and ticks the tree at the configured control rate.

Camera ImageConfigurable ROS 2 image topic.
GemmaPerceptionNodeJPEG encoding, prompt request, JSON parsing.
/vlm_dataScene state, hazard level, target visibility, bbox, confidence.
BT BlackboardLatest VLM state copied into BehaviorTree.CPP memory.
BehaviorTreeNodeFast periodic ticks of the configured BT XML.
Robot BehaviorsRecovery, target pursuit, or default behavior.

Runtime parameters live in config/vista_params.yaml. The same file configures the VLM host and timeouts, camera and output topics, JPEG quality, the system prompt, the BT tick rate, and the BT XML used by the control node.

Behavior Tree Contribution

The current Behavior Tree uses a root Fallback node to encode priority-based decision logic. The high-priority branch checks whether the VLM-reported hazard_level meets the configured threshold and then runs RecoveryBehavior. If no hazard branch succeeds, the target branch checks target_visible and runs PursueTarget with the VLM-provided bounding box. If neither branch applies, DefaultBehavior runs.

Custom Conditions

IsHazardous and IsTargetVisible read structured VLM state from BT input ports and decide whether higher-priority branches should execute.

Custom Actions

RecoveryBehavior, PursueTarget, and DefaultBehavior define the current action leaves. The repository implementation logs behavior selection rather than claiming a full robot actuation stack.

Blackboard Bridge

The ROS subscriber callback writes scene_state, hazard_level, target_visible, target_bbox, confidence, and description into the BT blackboard.

Configurable XML

The BT structure is loaded from YAML as XML using BehaviorTree.CPP format 4, so the decision tree can be adjusted from configuration.

Runtime Pipeline

The repository includes a ROS 2 launch file that starts both composable nodes inside component_container_mt with intra-process communication enabled. The provided tmux setup can start the local llama.cpp server, launch VISTA, and optionally play a rosbag configured in tmux/session.yaml.

The VLM request uses the configured system prompt to request structured JSON with scene state, hazard level, target visibility, target bounding box, confidence, and natural-language description. That structured message is the contract between the slow perception loop and the fast BT loop.

My Role

My role is focused on the Behavior Tree part of VISTA. In portfolio terms, this means presenting my work as BT architecture and implementation support, not as authorship of the full VISTA project or the VLM perception framework. The project repository identifies Andrea Ricciardi as maintainer and copyright holder, and I do not present myself as the project author.

Credits & Attribution

VISTA - Visual Intelligence State-Tree Architecture is ideated and maintained by Andrea Ricciardi. The repository package metadata lists Andrea Ricciardi as maintainer, and the license/notice files attribute copyright and software development to Andrea Ricciardi.

Third-party components referenced by the repository include Gemma, BehaviorTree.CPP, llama.cpp, cpp-httplib, and nlohmann/json, under their respective licenses.