Vision Fallback¶
⚠️ Partial 🔒 Internal 🤖 Android Only
Vision fallback uses Claude’s vision API to help locate UI elements when traditional element finding methods fail.
See the Status Glossary for chip definitions.
Current Implementation¶
Status¶
Vision fallback is an internal feature that is:
- Disabled by default to avoid unexpected API costs
- Only available when constructing
TapOnElementwith custom vision configuration - Not exposed via MCP server or CLI by default
- Currently integrated into
tapOnonly (invoked after polling times out) - Android screenshots only (iOS not yet implemented)
How It Works¶
When element finding fails after retries, TapOnElement follows this flow:
flowchart LR
A["Element finding retries
exhausted"] --> B["Screenshot capture
(~100-200ms)"];
B --> C["Claude vision analysis
(~2-5s)"];
C --> D{"Confidence high?"};
D -->|"yes"| E["Return alternative selectors
or navigation instructions"];
D -->|"no"| F["Return detailed error
with screen context"];
classDef decision fill:#CC2200,stroke-width:0px,color:white;
classDef logic fill:#525FE1,stroke-width:0px,color:white;
classDef result stroke-width:0px;
class A,E,F result;
class B,C logic;
class D decision;
Configuration¶
const tapTool = new TapOnElement(
device,
adb,
axe,
{
enabled: true, // Enable vision fallback
provider: 'claude', // Only Claude supported currently
confidenceThreshold: 'high', // Reserved for future gating
maxCostUsd: 1.0, // Warning threshold (does not block)
cacheResults: true, // Cache to avoid repeated calls
cacheTtlMinutes: 60 // Cache for 60 minutes
}
);
Note: MCP server constructs TapOnElement with default config (enabled: false), so vision fallback is not available through MCP unless you modify the server code.
Example Scenarios¶
Element Text Changed:
Input: tapOn({ text: "Login" })
Traditional Error:
Element not found with provided text 'Login'
With Vision Fallback:
Element not found. AI suggests trying:
- text: "Sign In" (Text label changed from 'Login' to 'Sign In')
(Cost: $0.0234, Confidence: high)
Element Requires Navigation:
Input: tapOn({ text: "Advanced Settings" })
With Vision Fallback:
Element not found, but AI suggests these steps:
1. Scroll down in the settings menu to reveal more options
2. Look for "Advanced Settings" in the newly visible section
(Cost: $0.0312, Confidence: high)
Element Doesn’t Exist:
Input: tapOn({ text: "Nonexistent Button" })
With Vision Fallback:
Element not found. The current screen shows a login form with
'Username', 'Password', and 'Sign In' elements. The requested
'Nonexistent Button' is not visible on this screen.
(Cost: $0.0198, Confidence: high)
Cost and Performance¶
Typical costs per failed search: - Input tokens: Screenshot + view hierarchy + prompt (~5,000-10,000 tokens) - Output tokens: Analysis response (~500-1,000 tokens) - Cost: $0.02-0.05 per vision fallback call
Performance: - Screenshot capture: ~100-200ms - Claude API call: ~2-5 seconds - Total: ~2-5 seconds (only when traditional methods fail)
Caching: - Cache key: Screenshot path + search criteria (text/resourceId) - TTL: 60 minutes (configurable) - Benefit: Instant response for repeated failures
API Key Setup¶
Vision fallback requires an Anthropic API key:
export ANTHROPIC_API_KEY=sk-ant-xxxxx
Get an API key at: https://console.anthropic.com/
Limitations¶
Current Limitations: 1. tapOn only: Not integrated into other tools (swipeOn, scrollUntil, etc.) 2. Android only: iOS screenshot capture not implemented 3. No auto-retry: Suggestions are informational - user must manually retry with suggested selectors 4. Not in MCP: Requires custom TapOnElement construction, not available via MCP server by default
When Vision Fallback Won’t Help: - Element truly doesn’t exist on screen - Screenshot quality is poor - Custom/non-standard UI elements - Dynamic content that changes rapidly
Proposed Future Architecture¶
🚧 Design Only
The following hybrid vision approach is not implemented - it is a design proposal for future enhancement.
Design Principles¶
- Last Resort: Only activate after all existing fallback mechanisms exhausted
- Cost Conscious: Prefer local models (80% cases), escalate to Claude only when needed
- High Confidence: Only suggest navigation steps when confidence is high
- Transparent: Clear error messages when fallback cannot help
- Fast & Offline: Local models provide <500ms responses without internet
Proposed Hybrid Approach¶
When implemented, add a Tier 1 local model layer before Claude:
- Tier 1: Fast, free local models (Florence-2, PaddleOCR) for common cases (~80%)
- OCR + object detection + element descriptions
- <500ms response time, $0 cost
-
Handles text extraction and simple element matching
-
Tier 2: Claude vision API for complex cases (~15%)
- Advanced navigation and spatial reasoning
- 2-5s response time, $0.02-0.05 cost
- Optional Set-of-Mark preprocessing
Expected Distribution: - 80% resolved by Tier 1 (local models find alternative selectors) - 15% resolved by Tier 2 (Claude provides navigation) - 5% genuine failures (element truly doesn’t exist)
Proposed Component Structure¶
export interface VisionFallbackConfig {
enabled: boolean;
// Tier 1: Local models
tier1: {
enabled: boolean;
models: Array<'florence2' | 'paddleocr'>;
confidenceThreshold: number; // 0-1
timeoutMs: number;
};
// Tier 2: Claude vision API
tier2: {
enabled: boolean;
useSoM: boolean; // Set-of-Mark preprocessing
confidenceThreshold: "high" | "medium" | "low";
maxCostUsd: number;
};
cacheResults: boolean;
cacheTtlMinutes: number;
}
Local Model Integration¶
Florence-2 for OCR + object detection: - Extract all text with bounding boxes - Detect UI elements (buttons, inputs, menus) - Generate element descriptions - ONNX runtime with CUDA/CPU execution
PaddleOCR as fallback: - Deep text extraction for complex/multi-language cases - Layout analysis (text, title, list, table, figure) - Used when Florence-2 confidence < 0.7
Future Enhancements¶
Planned improvements:
- Auto-retry: Automatically retry with suggested selectors
- More tools: Integrate into
swipeOn,scrollUntil, etc. - Set-of-Mark: Enhanced spatial understanding with visual markers
- Learning: Track corrections to improve suggestions over time
- Multi-screenshot analysis: Compare before/after states
- Visual regression detection: Alert when UI changed significantly
See Also¶
- MCP tool reference - Tool implementation details
- Feature Flags - Runtime configuration