Basic Usage#

Simple Example#

The following code performs a deterministic action on the click-test-2 environment.

import time
import gymnasium
import miniwob
from miniwob.action import ActionTypes

gymnasium.register_envs(miniwob)

env = gymnasium.make('miniwob/click-test-2-v1', render_mode='human')

# Wrap the code in try-finally to ensure proper cleanup.
try:
  # Start a new episode.
  observation, info = env.reset()
  assert observation["utterance"] == "Click button ONE."
  assert observation["fields"] == (("target", "ONE"),)
  time.sleep(2)       # Only here to let you look at the environment.
  
  # Find the HTML element with text "ONE".
  for element in observation["dom_elements"]:
    if element["text"] == "ONE":
      break

  # Click on the element.
  action = env.unwrapped.create_action(ActionTypes.CLICK_ELEMENT, ref=element["ref"])
  observation, reward, terminated, truncated, info = env.step(action)

  # Check if the action was correct. 
  print(reward)      # Should be around 0.8 since 2 seconds has passed.
  assert terminated is True
  time.sleep(2)

finally:
  env.close()

The output should look something like this:

After 2 seconds:

Environment Initialization#

An environment can be created using gymnasium.make:

env = gymnasium.make('miniwob/click-test-2-v1', render_mode='human')

Common arguments include:

render_mode: Render mode. Supported values are:
- None (default): Headless Chrome, which does not show the browser window.
- "human": Show the browser window.
action_space_config: Configuration for the action space. Supported values are:
- An ActionSpaceConfig object.
- A preset name, which will instantiate an ActionSpaceConfig object.

Observation Space#

observation, info = env.reset(seed=42)
observation, reward, terminated, truncated, info = env.step(action)

The reset and step methods return an observation, which is a dict with the following fields:

utterance: Task instruction string, such as "Click button ONE.".
fields: Environment-specific key-value pairs extracted from the utterance, such as (("target", "ONE"),).
screenshot: A numpy array of shape (height, width, 3) containing the RGB values.
dom_elements: A tuple of dicts, each listing properties like the geometry and HTML attributes of a visible DOM element.

For example, the observation from the reset command above is

{
  'utterance': 'Click button ONE.',
  'fields': (('target', 'ONE'),),
  'screenshot': array([[[255, 255,   0], ...], ...], dtype=uint8),
  'dom_elements': (
    {'ref': 1, 'parent': 0, 'tag': 'body', ...},
    {'ref': 2, 'parent': 1, 'tag': 'div', ...},
    {'ref': 3, 'parent': 2, 'tag': 'div', ...},
    {'ref': 4, 'parent': 3, 'tag': 'button', 'text': 'ONE', ...},
    {'ref': 5, 'parent': 3, 'tag': 'button', 'text': 'TWO', ...},
  ),
}

See the Observation Space page for more details.

Action Space#

action = env.unwrapped.create_action(ActionTypes.CLICK_ELEMENT, ref=element["ref"])
observation, reward, terminated, truncated, info = env.step(action)

The step method takes an action object, which should be a dict with the following fields:

action_type: The action type index from env.unwrapped.action_space_config.action_types.
Other fields such as ref, coords, text, etc. should be specified based on the action type. The action space env.unwrapped.action_space specifies which fields should be included.

For example, the action from the create_action command above is

{
  'action_type': 8,     # ActionTypes.CLICK_ELEMENT in the default action config.
  'ref': 4,             # The button with text 'ONE' from observation['dom_elements'].
  ...                   # Other fields are ignored for CLICK_ELEMENT.
}  

In actual code, the web agent should generate an action based on the observation.

See the Action Space page for more details.