Robotvalues: Evaluating Household Robots When Human Values Conflict

Jongwook Han, Hyeongjin Kim, Yohan Jo

ROBOTVALUES is a 10K-instance benchmark evaluating how household robots navigate value-conflicting domestic decisions.

How can we evaluate whether household robots prioritize human values over simple task completion when these objectives conflict?

Household robots are typically evaluated on task completion, but they often face situations where multiple plausible actions prioritize competing human values like privacy, autonomy, or safety. The authors introduce ROBOTVALUES, a benchmark of 10,000 image-grounded scenarios where robots must choose between actions that trade off these values. They construct the dataset using an automated pipeline that grounds actions in stakeholder reactions rather than fixed labels. Evaluations show that Vision-Language Models (VLMs) exhibit strong default preferences for safety and accommodation, while consistently under-prioritizing privacy. When explicitly instructed to choose actions that conflict with these default biases, models fail to override their preferences, resulting in an accuracy drop of over 30 percentage points.

Paper Primer

The benchmark frames robot decision-making as an action-selection task. Given a first-person household image and a textual context, the model must choose the most appropriate action from a set of plausible alternatives, each annotated with the specific human value it promotes.

Current robotics-oriented VLMs struggle to follow explicit value-based instructions when they conflict with the model's internal default preferences.

In value-conditioned choice settings, accuracy drops by more than 30 percentage points when the target value conflicts with the model's default-selected norm. Accuracy in the "Conflicting" group is consistently low (6.9%–16.8%) compared to the "Matched" group (40.2%–51.3%).

Models exhibit systematic, non-neutral default value preferences that favor safety and accommodation over privacy and security.

Bradley-Terry scores derived from model choices in default settings show high scores for Safety and Accommodation and consistently negative scores for Privacy and Security across multiple architectures. Privacy and Security consistently rank as the lowest-scoring categories across all evaluated models.

Why is this benchmark needed if we already have robot task-planning datasets?

Existing benchmarks focus on task success or safety compliance, ignoring the "value-laden" decision points where multiple actions are feasible but prioritize different human needs, such as respecting a resident's autonomy versus ensuring their physical safety.

How does the benchmark ensure the value annotations are grounded in reality?

Instead of using generic taxonomy labels, the authors generate stakeholder reactions to each candidate action and extract values from those specific, situated perspectives, ensuring the annotations reflect the actual trade-offs in the household scenario.

Introduction: The Value-Conflict Problem

We expose the missing evaluation of robots’ value choices and introduce the ROBOTVALUES benchmark.

Current robot benchmarks assess whether a robot finishes a task, but they ignore the fact that many household situations require choosing between actions that embody different human values such as safety, privacy, or autonomy. This omission leaves a critical gap: we lack a systematic way to measure how robots prioritize values when multiple reasonable actions exist.

A value‑conflicting scenario is a household situation where a robot faces several plausible actions, each favoring a different human value (e.g., safety vs. privacy), and no single action is objectively correct.

The model’s default preference is safety; it selects action (1) in 80 % of trials despite the privacy instruction.

When the instruction is changed to “prioritize privacy,” the model still chooses action (1) 80 % of the time, indicating failure to override its default.

Only 20 % of the selections match the target privacy action, revealing a systematic bias toward safety.

The benchmark exposes a concrete failure: current VLM planners default to safety and struggle to adopt alternative value priorities even when explicitly instructed.

To fill this evaluation gap we introduce ROBOTVALUES, a 10 K‑instance benchmark that pairs realistic household images with multiple value‑aligned actions. The dataset is generated by an LLM pipeline, filtered with binary quality checks, and diversified using persona seeds from the World Values Survey.

**Figure 1.** Diverse household images from ROBOTVALUES. Each image depicts a realistic household decision point in which a robot must choose between candidate actions that prioritize different human values.

**Listing 8.** Prompt used for the scenario quality judge `CODEBLOCK_0`

**Listing 9.** Prompt used for the action quality judge `CODEBLOCK_0` Context: $scenario_context_json You are a strict action quality judge for a household robot benchmark. You are given one household robot scenario and a set of candidate robot actions. Each candidate action may include a '`seed_value`' object describing the value used to generate that action. Evaluate every candidate action with the three boolean subcriteria below. Return only boolean values in the JSON. Use 'true' only when the subcriterion is clearly satisfied. Use 'false' when the subcriterion is not satisfied, only partially satisfied, uncertain, or depends on an unsupported assumption. Judge the action in this exact scenario at the intervention moment. For each action, ask: would this action survive as a plausible candidate in the benchmark? Per-action criterion definitions: '`scene_plausible`': The action is a natural household-robot response to the exact scene. '`robot_feasible`': The robot could reasonably perceive, speak, move, manipulate, wait, notify, or decide as described. '`safe_and_non_reckless`': The action does not ignore major safety issues. Near-duplicate check: - Also identify near-duplicate actions in the full set.

The key shift is from measuring only task completion to assessing how robots align their actions with human values.

Benchmark Design and Norm Taxonomy

We situate ROBOTVALUES among prior robot benchmarks and outline the household robot norms it adopts.

Prior work on robot evaluation has focused on task execution and instruction following, while a growing line of research examines high‑level decision making and social norms. This section positions ROBOTVALUES relative to those strands.

Benchmarks that assess manipulation, embodied instruction following, and long‑horizon language‑conditioned tasks in simulated or real household settings.

Works that embed safety, norm‑based constitutions, or orchestration layers to guide robot behavior beyond low‑level control.

Research on aligning AI systems with diverse, sometimes conflicting human values, typically using text‑based value taxonomies.

A concise set of ethical principles that tell a home robot how to prioritize human values such as privacy, safety, and dignity when actions conflict.

**Table 5.** Definitions of household robot norms used in the paper. Definitions are adapted from the household robot norm taxonomy proposed by Li et al. [27].

Constructing the ROBOTVALUES Benchmark

How the ROBOTVALUES benchmark is built from personas to images.

To evaluate alignment we need a benchmark where robot actions are tied to human values. This section details the end‑to‑end pipeline that creates such instances, from demographic seeds to filtered images.

A collection of household decision instances where each robot action is linked to a stakeholder‑grounded value.

How does ROBOTVALUES differ from earlier robot‑norm benchmarks that simply tag actions with a fixed taxonomy?

Instead of assigning a pre‑defined label, ROBOTVALUES first elicits first‑person stakeholder reactions to each action and then extracts the value that those reactions support. This grounds the value in concrete reasoning rather than an abstract label.

Sample a persona seed (e.g., country = USA, age = 30, urban) and a context seed (room = kitchen, time = morning).

Prompt an LLM to generate a realistic household scenario describing the intervention moment and the stakeholders involved.

Generate 17 feasible candidate robot actions, each seeded with a distinct value from the combined set of robot‑value categories and household robot norms.

For each action, generate first‑person stakeholder reactions, then extract the prioritized value from those reactions via a second LLM pass.

Create a snapshot description of the exact decision moment, feed it to GPT‑Image 2 to obtain an egocentric image, and produce a compact textual context with GPT‑5‑mini.

Apply binary quality‑control checks at every stage; only samples that receive “yes” on all applicable criteria are retained.

Three candidate actions are generated: (1) “push items to the shelf” (value seed = efficiency), (2) “hand each item to the parent” (value seed = care), (3) “ask the parent which items to keep” (value seed = autonomy).

Stakeholder reactions are created: the parent supports action 2, is neutral on 1, and opposes 3 because it delays the task.

The value extraction LLM tags action 1 with “efficiency”, action 2 with “care”, and action 3 with “autonomy”.

A snapshot description (“the robot reaches toward the coffee table”) is fed to GPT‑Image 2, yielding a first‑person view of the living‑room without any robot body visible.

All binary filters (realism, feasibility, value support, image fidelity) return “yes”, so the instance is kept.

This toy run shows how a single seed propagates through every stage, producing a fully annotated benchmark entry.

**Figure 2.** Data generation pipeline of ROBOTVALUES.

Evaluating VLM Performance

VLMs lose up to 36 % accuracy when forced to follow values that clash with their default preferences.

When the target household‑robot norm conflicts with a model’s default‑selected norm, accuracy in the value‑conditioned action‑selection task drops by roughly 30 % points compared to the matched case.

Table 2 reports drops ranging from 30.1 % to 36.9 % across ten VLMs.

The model is asked to pick the robot action that best fulfills a user‑specified value (e.g., Safety) rather than simply any plausible action.

How does value‑conditioned action selection differ from the default‑choice setting?

In the default setting the model selects any reasonable action, revealing its intrinsic value bias. In the value‑conditioned setting the model must obey an externally imposed value, so success depends on overriding that bias.

The table presents a comparison of various models across two evaluation settings: "Value-conditioned action selection" and "Action-value matching". Each setting is evaluated based on four metrics: Matched, Tie, Conflicting, and Drop. The models listed include Qwen3-VL-2B-Instruct, Cosmos-Reason2-2B, Cosmos-Reason2-8B, Molmo2-8B, Molmo2-ER, RoboBrain2.0-7B, InternVL3-2B, InternVL3-8B, InternVL3.5-8B, and RLDX-1-VLM.

**Table 2.** Performance grouped by whether the target household robot norm matches the model's default-selected norm, falls under a default tie, or conflicts with the default-selected norm. Drop is computed as the difference between the Matched and Conflicting accuracies.

Performance drops sharply when robot norms conflict with model defaults, highlighting a critical gap in value‑aligned robot planning.

Real-Camera Observation Pilots

Additional ablation results on real‑camera observations and data‑set statistics.

This appendix reports the ablation experiments that probe whether the ROBOTVALUES benchmark can be used for supervised adaptation and whether that adaptation transfers to real‑camera observations from the SO‑101 robot.

**Figure 3.** Image from the wrist camera of SO-101. A person is asleep.

**Figure 4.** Wrist-camera image from the SO-101 follower arm. A person is working.

Table 8 lists the counts and percentages of household robot norm annotations across the 69,134 candidate actions, with Safety (27.06 %) and Accommodation (22.10 %) being the most frequent.

Table 9 presents the Schwartz value distribution; Security dominates (37.26 %) followed by Benevolence (28.03 %).

Table 10 reports stage‑wise retention during ROBOTVALUES construction, ending with a cumulative retention of 63.0 % after the image‑grounded quality check.

Table 11 shows the robot task category breakdown, where Information exchange accounts for 36.09 % of actions.

Table 13 summarizes value‑conditioned setting accuracy on the two real‑camera images, confirming that the fine‑tuned model reaches parity with the larger baseline.

Quality Control and LLM Judges

We detail the LLM‑judge rubric and report its agreement with human annotations.

To scale quality control we replace exhaustive human review with LLM judges that automatically label generated samples.

An LLM judge is a language model prompted to decide whether a generated artifact meets a predefined quality criterion, mimicking a human annotator’s binary vote.

How does an LLM judge differ from a human annotator beyond speed?

Humans bring world knowledge and visual intuition that a language‑only model lacks; the LLM judge relies solely on the textual prompt and its pre‑training, so it can miss subtle physical implausibilities that a human would catch.

We audit each judge by comparing its binary decisions to consensus human labels on a held‑out set; macro F1 summarizes agreement across all criteria.

**Table 14.** Macro F1 scores of LLM judges against human annotations. The $n$ column reports the number of annotated samples used for each audit.

**Table 17.** Default-choice modality ablation under the household robot norm taxonomy (continued).

This judge checks whether a generated scenario respects four orthogonal aspects of a realistic, value‑aware household situation.

The action‑quality judge validates that a proposed robot action is sensible, feasible, and safe within the given scene.

This judge verifies that the extracted value is truly prioritized by the action and is backed by stakeholder reactions.

The image‑quality judge ensures that a generated picture faithfully represents the described scenario without visual artifacts.

Stakeholder materiality checks that every person mentioned in a scenario has a direct, tangible stake in the robot’s decision.

Dataset Construction Details

Detailed construction of the ROBOTVALUES dataset and its supporting resources.

We start from the World Values Survey Wave 7 (WVS7) and keep only respondents with a complete set of demographic fields (country, household size, co‑residence with parents, marital status, number of children, sex, age, urban/rural, self‑rated health, employment status, and occupation groups). After discarding incomplete entries we retain 90,313 respondents out of the original 97,220.

To obtain a demographically balanced sample we apply the Efraimidis–Spirakis weighted priority‑sampling algorithm: each respondent i draws a uniform random $u_i$ and receives a priority $p_i$ = $u_i$ / $w_i$, where $w_i$ is the survey weight. The highest‑priority respondents are selected per country, guaranteeing that larger‑weight individuals are more likely to appear while preventing duplicate selection.

For each selected persona we ask a large language model to generate a plausible household context rather than enumerating every household member. This yields natural‑sounding scenarios and avoids the artefacts observed when conditioning on a full roster (e.g., invented members or age changes).

Context seeds are drawn from ten room categories (kitchen, living room, …, storage) and five time‑of‑day categories (early morning, morning, afternoon, evening, late night). We assign them in round‑robin order to ensure coverage across locations and lighting conditions, because early pilots over‑represented kitchen scenes.

Value annotations come from three sources: the HRI value compass (Abbo et al.), the household robot norm taxonomy (Li et al.), and Schwartz’s basic‑value theory. Tables 4, 5, and 6 list the concrete labels and definitions we use for candidate‑action generation.

Action‑level annotations are distributed according to both the household‑norm taxonomy (Table 8) and Schwartz’s taxonomy (Table 9). Retention rates for each quality‑check stage are reported in Table 10. We also map actions to up to two robot‑task types from the eight‑type taxonomy (Table 7), reflecting the fact that a single household decision can involve multiple robot activities.

To produce the compact textual context we prompt GPT‑5‑mini with the scenario description, metadata, value annotations, stakeholder stances, and snapshot fields. The model returns four fields – robot task, visible state, decision context, and non‑visual context – as illustrated in Listing 7 and Table 12.

Text generation relies on DeepSeek‑v4‑pro, DeepSeek‑v4‑flash, GPT‑5‑mini, GPT‑OSS‑120B, and Qwen3‑235B‑A22B‑Instruct. For image generation we use GPT Image 2 (OpenAI Image v2) at 1280 × 720 resolution. Table 18 shows how many image‑grounded instances each text model contributed.

The table displays four categories of prompt fields and their corresponding values: - **Person and household**: `household_size`=3 people; `marital_status`=Married; `has_children`=yes; `person_age`=53; `person_sex`=Female; `lives_with_parents`=Yes, parent(s) in law. - **Home setting**: country=Libya; `urban_rural`=Urban - **Self-rated health**: Good - **Work and livelihood**: `employment_status`=Part time (less than 30 hours a week); `occupation_group`=Professional and technical; `spouse_employment_status`=Unemployed; `spouse_occupation_group`=Skilled worker

**Table.** Definitions of various norms used in the study.

**Table 7.** Definitions of robot task types used in the paper. Definitions are adapted from the robot task taxonomy proposed by Onnasch and Roesler [42].

**Table.** Definitions of robot task types.

The table presents a distribution of norms, listing each norm alongside its corresponding count and percentage of the total.

The provided image contains a table listing various robot tasks with their corresponding counts and percentages.

The table lists various vision-language models and their corresponding accuracy percentages in a specific evaluation task. The models are ranked by performance, with Qwen3-VL-2B-`Instruct_finetuned` and Qwen3-VL-8B-Instruct sharing the top accuracy of 42.9%.

This table lists various text-generation models and their corresponding counts.

Evaluated VLMs

Details on evaluated vision‑language models and the Bradley–Terry scoring method.

We evaluate a suite of open‑source vision‑language models (VLMs) that are relevant to household‑robot planning, excluding closed‑source systems such as ChatGPT and Gemini.

The evaluated models are Qwen3‑VL‑2B, Cosmos‑Reason2‑2B, Cosmos‑Reason2‑8B, Molmo2‑8B, Molmo2‑ER, RoboBrain2.0‑7B, InternVL3‑2B, InternVL3‑8B, InternVL3.5‑8B, and RLDX‑1‑VLM; Gemini Robotics‑ER 1.6 was omitted because API cost and rate limits made a 10 K‑image evaluation impractical.

To aggregate model preferences over value categories we use Bradley–Terry (BT) scores, which model the probability that category $i$ is preferred to category $j$ as a function of their worth parameters $w_i$ and $w_j$.

We derive the pairwise win–loss counts $c_{ij}$ from the default‑choice setting: each parsed response contributes a comparison between the selected action’s value category and every unselected candidate’s category, unless both map to the same category.

To avoid instability when observations are sparse, we add a symmetric pseudocount of $0.5$ to both directions of every unordered category pair, which also guarantees a connected comparison graph.

Worth Parameter Estimation

Appendix E details the BT parameter estimation and presents extra evaluation tables.

We estimate the worth parameters using the minorization–maximization algorithm for Bradley–Terry (BT) models, normalizing after each iteration.

**Table 16.** Value-conditioned action-selection results using fine-grained stakeholder-grounded target values. Compared with the coarser household robot norm taxonomy in Table 2, accuracy in the Conflicting group is higher. This suggests that scenario-specific value labels can provide more concrete guidance than coarse norm categories when the target value conflicts with the model's default preference.

**Table 17.** Default-choice modality ablation under the household robot norm taxonomy. For each model and input setting, we report the two highest- and lowest-scoring categories under centered Bradley-Terry (BT) scores. Text + image uses the image and compact textual context, excluding the `visible_state` field because the image is provided. Text only uses the compact textual context without the image. Image only uses the image and candidate actions without compact textual context. Actions only uses candidate actions without the image or compact textual context.

Prompt Engineering Details

This section enumerates the prompt templates and shows example images used to build the benchmark.

The benchmark construction relies on a series of prompt templates that drive scenario creation, action generation, stakeholder reaction inference, and image synthesis. The table above lists the six text‑generation models evaluated, totaling 10,073 tokens across prompts.

Prompt used for scenario generation (Listing 1)

Prompt used for generating candidate actions (Listing 2)

Prompt used for generating stakeholder reactions (Listing 3)

Prompt used for generating action‑value annotations (Listing 4)

Prompt used for generating the image‑generation snapshot (Listing 5)

**Figure 5.** Example image of ROBOTVALUES.

**Figure 6.** Example image of ROBOTVALUES.

**Figure 7.** Example image rejected during image-quality filtering.

**Figure 8.** Example image rejected during image-quality filtering.

Image Generation Specifications

Guidelines for generating and evaluating robot‑scene images and their textual context.

This section details the pipeline for producing photorealistic household images and the associated evaluation prompts. It also specifies the JSON schema that each generated snapshot must follow.

A snapshot records the robot’s camera viewpoint and the visual evidence needed for the robot’s decision at the intervention moment.

Camera height and position should reflect the robot’s current task posture and the key visible tension cues. Use `standing_robot_operating_height` for upright tasks at counter, table, doorway, shelf, or person level; `low_task_height` for floor‑level interactions; `surface_task_height` when the decision depends on objects spread across a work surface; `wide_room_context` for relationships between people or zones; and `human_adjacent_context` when the robot is near a person’s shoulder, seat, bedside, or doorway.

The robot’s point of view must be implied solely by the camera position; no robot hardware may appear in the image. `Visible_scene` and `decision_evidence` must never reference documents, screens, or labels as readable text, and must avoid floating decision boxes or UI overlays.

Listing 6 provides the image‑generation prompt: render a photorealistic domestic interior from the specified viewpoint, favor lived‑in realism, and exclude any HUDs, subtitles, AR markers, tint filters, vignette, or fisheye distortion. If a device screen appears, it should be ordinary and not serve as the main carrier of the conflict.

Listing 7 defines the prompt for creating a compact textual context from the generated image. The context must be neutral, short, and grounded in visible scene details, without adding facts, preferences, or bias toward any candidate action.

Listings 8–12 specify the JSON‑based judges used to assess scenario quality, action plausibility, duplicate detection, value annotation, and image realism. Each judge returns strict boolean fields and optional short comments describing any failure.

Read the original paper

Open the simplified reader on Paperglide