Basic Usage#

Simple Example#

The following code performs a deterministic action on the click-test-2 environment.

import time
import gymnasium
from miniwob.action import ActionTypes

env = gymnasium.make('miniwob/click-test-2-v1', render_mode='human')

# Wrap the code in try-finally to ensure proper cleanup.
try:
  # Start a new episode.
  observation, info = env.reset()
  assert observation["utterance"] == "Click button ONE."
  assert observation["fields"] == (("target", "ONE"),)
  time.sleep(2)       # Only here to let you look at the environment.
  
  # Find the HTML element with text "ONE".
  for element in observation["dom_elements"]:
    if element["text"] == "ONE":
      break

  # Click on the element.
  action = env.unwrapped.create_action(ActionTypes.CLICK_ELEMENT, ref=element["ref"])
  observation, reward, terminated, truncated, info = env.step(action)

  # Check if the action was correct. 
  print(reward)      # Should be around 0.8 since 2 seconds has passed.
  assert terminated is True
  time.sleep(2)

finally:
  env.close()

The output should look something like this:

../../_images/example-usage-1.png

After 2 seconds:

../../_images/example-usage-2.png

Environment Initialization#

An environment can be created using gymnasium.make:

env = gymnasium.make('miniwob/click-test-2-v1', render_mode='human')

Common arguments include:

  • render_mode: Render mode. Supported values are:

    • None (default): Headless Chrome, which does not show the browser window.

    • "human": Show the browser window.

  • action_space_config: Configuration for the action space. Supported values are:

Observation Space#

observation, info = env.reset(seed=42)
observation, reward, terminated, truncated, info = env.step(action)

The reset and step methods return an observation, which is a dict with the following fields:

  • utterance: Task instruction string, such as "Click button ONE.".

  • fields: Environment-specific key-value pairs extracted from the utterance, such as (("target", "ONE"),).

  • screenshot: A numpy array of shape (height, width, 3) containing the RGB values.

  • dom_elements: A tuple of dicts, each listing properties like the geometry and HTML attributes of a visible DOM element.

For example, the observation from the reset command above is

{
  'utterance': 'Click button ONE.',
  'fields': (('target', 'ONE'),),
  'screenshot': array([[[255, 255,   0], ...], ...], dtype=uint8),
  'dom_elements': (
    {'ref': 1, 'parent': 0, 'tag': 'body', ...},
    {'ref': 2, 'parent': 1, 'tag': 'div', ...},
    {'ref': 3, 'parent': 2, 'tag': 'div', ...},
    {'ref': 4, 'parent': 3, 'tag': 'button', 'text': 'ONE', ...},
    {'ref': 5, 'parent': 3, 'tag': 'button', 'text': 'TWO', ...},
  ),
}

See the Observation Space page for more details.

Action Space#

action = env.unwrapped.create_action(ActionTypes.CLICK_ELEMENT, ref=element["ref"])
observation, reward, terminated, truncated, info = env.step(action)

The step method takes an action object, which should be a dict with the following fields:

  • action_type: The action type index from env.unwrapped.action_space_config.action_types.

  • Other fields such as ref, coords, text, etc. should be specified based on the action type. The action space env.unwrapped.action_space specifies which fields should be included.

For example, the action from the create_action command above is

{
  'action_type': 8,     # ActionTypes.CLICK_ELEMENT in the default action config.
  'ref': 4,             # The button with text 'ONE' from observation['dom_elements'].
  ...                   # Other fields are ignored for CLICK_ELEMENT.
}  

In actual code, the web agent should generate an action based on the observation.

See the Action Space page for more details.