SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-Cognitive Variations
Taewon Yun, Hyeonseong Park, Jeonghwan Choi, Hayoon Park, Yeeun Choi, Hwanjun Song
SoCRATES is a benchmark for evaluating proactive LLM mediators across diverse conflict domains and socio-cognitive conditions.
How can we reliably evaluate LLM mediators across diverse, complex real-world conflicts and varying socio-cognitive disputant behaviors?
Evaluating LLM mediators is difficult because mediation is a dynamic, multi-turn process where success depends on shifting emotions and context, yet existing benchmarks rely on narrow, expert-authored domains that fail to capture real-world complexity. SoCRATES automates this by using an agentic pipeline to curate diverse conflict scenarios, probing mediator performance across five independent socio-cognitive axes, and scoring trajectories using a topic-localized evaluator that ignores irrelevant content. Even the strongest frontier models close only about a third of the unmediated consensus gap, with performance varying sharply depending on the specific social or cognitive demands of the conflict.
Paper Primer
The framework operates in three stages: agentic scenario curation, socio-cognitive probing, and topic-localized evaluation. By isolating variables like cultural identity or emotional reactivity into independent axes, the system identifies exactly which competencies a mediator lacks rather than providing a single, opaque score.
The topic-localized evaluator significantly improves alignment with human expert judgment compared to per-turn baselines.
Pearson correlation with human experts on trajectory-level scoring. 0.82 (more than doubling the 0.40 baseline performance).
General LLM capability does not guarantee mediation success; performance is highly sensitive to conflict domain and socio-cognitive context.
Consensus gain drops significantly when mediators face adversarial (Competing) or emotionally volatile (Reactive) conditions. Consensus gain ranges from 41.3 in transactional disputes to 16.6 in intra-organizational conflicts.
Why is "topic-localized" evaluation necessary for mediation?
Standard per-turn evaluators score every topic at every turn, allowing off-topic noise to distort the consensus trajectory. SoCRATES only scores topics when they are actively in play, preventing irrelevant content from compounding errors.
How does SoCRATES ensure scenarios are actually difficult?
The framework uses a rejection-sampling pipeline where candidate scenarios are simulated without a mediator; only those that consistently end in impasse are retained for the benchmark.
Introduction and Motivation
We expose the gap between static mediation benchmarks and real‑world conflict complexity and introduce SoCRATES to close it.
LLM mediation refers to using large language models as third‑party agents that intervene in a dispute, steering parties toward agreement while handling shifting emotions, intentions, and context in real time.
Current mediation testbeds suffer three concrete shortcomings: they cover only a handful of expert‑authored domains, they vary solely along strategic posture, and they score every turn against every topic, which lets off‑topic chatter distort the signal.
SoCRATES is an automated benchmark that builds realistic conflict scenarios, probes mediators along independent socio‑cognitive axes, and evaluates their trajectories with a topic‑localized scorer.
SoCRATES tackles the three challenges identified earlier: (1) it scales scenario coverage by treating scenario creation as an autonomous agentic pipeline; (2) it isolates socio‑cognitive variation by probing each axis separately; and (3) it provides trajectory‑aware, noise‑resilient scoring through topic‑localized evaluation.
The image displays the logo for the Data Intelligence System Lab (DISL), featuring a stylized monogram of the letters "D" and "I" on the left, followed by the text "DISL" in a bold, sans-serif font, with the full name "DATA INTELLIGENCE SYSTEM LAB" written in a smaller, lighter font underneath.
The core gap is that static benchmarks cannot capture the multi‑dimensional, real‑world complexity of mediation, and SoCRATES provides a unified solution.
The SoCRATES Framework
We detail the three‑stage SoCRATES pipeline and its experimental configuration.
The SoCRATES benchmark proceeds in three sequential stages: (1) Agentic Scenario Curation builds a pool of realistic conflict scenarios; (2) Socio‑Cognitive Probing expands each scenario along five independent axes; and (3) Topic‑Localized Evaluation scores mediator behavior with three per‑turn metrics.
LLM agents automatically retrieve real‑world disputes, rewrite them into a structured format, and filter out any that resolve without mediation.
How does this pipeline differ from prior testbeds that relied on human‑crafted scenarios?
Earlier benchmarks required experts to hand‑pick and manually rewrite cases, limiting domain coverage. Our agentic pipeline scales automatically across eight domains and produces 40 diverse scenarios without manual labor.
Searcher returns the seed case (1 article, 2 parties).
Writer produces the structured tuple s as described.
Three simulations are executed; all end at turn 100 with no consensus.
Scenario is kept for probing because it satisfies the impasse criterion.
This concrete trial shows how a single domain yields a full scenario that survives the filtering stage, illustrating the end‑to‑end flow without any human intervention.
Each curated scenario is duplicated and altered along a single socio‑cognitive dimension to isolate its effect on mediator performance.
Why are the axes applied one at a time instead of combined?
Combining axes would entangle their influences, making it impossible to trace a performance shift back to a specific socio‑cognitive factor. Independent application preserves a clean causal link.
Instead of scoring the whole dialogue, the evaluator scores each topic only at the turns where that topic is actively discussed, then aggregates these scores into per‑turn consensus metrics.
Why not score every turn for every topic, as some prior benchmarks do?
Scoring inactive topics adds irrelevant variance; focusing on active turns yields a cleaner signal and improves agreement with human experts, as shown by the higher Pearson correlation.
Agentic Scenario Curation assembles a diverse pool of conflict scenarios from real‑world sources.
Socio‑Cognitive Probing expands each scenario along five independent axes, creating matched mediated and unmediated runs.
Topic‑Localized Evaluation scores each run with per‑turn consensus metrics and aggregates the three final performance measures.
**Figure 1.** Overview of SoCRATES: agentic scenario curation grounds scenarios in a real conflict, socio-cognitive probing expands scenarios along five axes to expose where mediators fails, and topic-localized evaluation scores each trajectory with three metrics to quantify the mediator's contribution.
Validation of the Benchmark
We confirm SoCRATES’ simulators and evaluator reliably reflect persona intensity and expert judgments.
The topic‑localized evaluator aligns with expert judgments far better than prior LLM judges.
Table 2 shows Pearson $r$ = 0.823 on full trajectories, surpassing baselines.
All other experimental settings—annotation pool size, Krippendorff’s $α$ (0.75 for simulation, 0.86 for evaluator), and the four scalar levels {0, 0.33, 0.66, 1}—were held constant across simulators and evaluators.
**Table 1.** Simulation fidelity for persona fidelity (accuracy (%) via A/B comparison based evaluation)
**Table 2.** Evaluator alignment with experts (Pearson r). The values in parenthesis represent p-values.
Benchmarking LLM Mediators
We evaluate eight LLM mediators on diverse conflict domains and socio‑cognitive stresses.
Mediator performance clusters into a top tier (30.7–34.4 consensus gain) and a bottom tier (15.7–21.0).
Table 3 shows the same split across all eight conflict domains, with no mediator exceeding half the unmediated gap.
Consensus gain measures how much a mediator lifts the agreement rate above the baseline where no mediator intervenes.
Why isn’t consensus gain simply the final agreement rate?
Agreement rates differ across domains; the gain subtracts the domain‑specific baseline so we compare only the mediator’s added value.
Effectiveness quantifies how much a mediator’s intervention improves the outcome when it actually speaks.
How does effectiveness differ from timeliness?
Timeliness counts *when* an intervention occurs; effectiveness measures *how much* that intervention moves the parties toward agreement.
Timeliness records the proportion of dialogue turns at which a mediator chooses to intervene.
Why would a mediator that intervenes on every turn be undesirable?
Constant interruptions can overwhelm participants and prevent natural resolution pathways, leading to lower consensus gain despite high timeliness.
**Table 3.** Conflict resolution performance of the eight mediators across eight domains: Trans (Transactional), Heal (Healthcare), Env (Environmental), B2B (Business-to-Business), Pol (Public-Policy), Intl (International), Legal (Legal), and Intra (Intra-organizational). Cell color intensity increases within each column to indicate higher scores.
**Figure 2.** Mediator adaptation across general condition and five socio-cognitive axes, measured by consensus gain.
**Figure 3.** Consensus gain shift from the general (unperturbed) condition along three axes: (a) strategic posture, (b) emotional reactivity, and (c) cultural identity. Negative values indicate degradation, positive values improvement.
**Figure 4.** Intervention Effectiveness over conversation progress, where turns are mapped to a 0–100% scale to align varying turn counts, across the general condition and each hard condition from five socio-cognitive axes.
Robustness and Sensitivity
We examine robustness of mediator rankings to evaluator and simulator changes, and to repeated runs.
We first test whether the choice of evaluator backbone influences mediator rankings. Swapping DeepSeek‑V3.2 with Qwen3‑235B‑A22B‑Instruct yields only small metric shifts.
Next we examine robustness to the disputant simulator. Replacing DeepSeek‑V3.2 party agents with Qwen3‑235B‑A22B‑Instruct changes absolute scores but preserves the relative adaptation patterns.
Conclusion and Discussion
We detail the LLM backbones, scenario construction pipeline, and artifact tables supporting SoCRATES.
Our experiments rely on publicly available LLMs for parties, mediators, and evaluators, accessed through OpenAI and Google APIs for proprietary models and Hugging Face checkpoints for open‑weight models under their respective licenses.
Table 4 enumerates the exact backbones used at each pipeline stage—searcher, scenario writer, simulator, mediator pool, and evaluator—along with their source (API or checkpoint) and citation.
**Table 4.** Backbone LLM configurations for SoCRATES.
Scenario construction proceeds in three stages, each documented with a dedicated prompt table.
Seed Search uses the o4‑mini‑deep‑research agent (Table 10) to retrieve a concise conflict report covering timeline, stakeholders, core issues, institutional tensions, and current status.
The image is a structured table summarizing the conflict surrounding the Mount Sinai Beth Israel hospital downsizing and closure. It contains six rows categorized by: Conflict, Timeline of key events, Key stakeholders, Core issues of disagreement, Institutional tensions, and Current status.
Scenario Recast (Table 11) transforms each seed into a structured negotiation script, enforcing fictional names, a maximum of four topics, and at least one emotionally provocative issue.
**Table.** Downtown General Wind-Down: Regulator–Provider Bargaining Over Access, Capacity, and Accountability
Preference Weighting (Table 14) prompts the GPT‑5.4 writer to assign integer weights summing to 100 for each party, explicitly forbidding uniform distributions.
**Table 14.** Prompt for per-party preference weighting.
The Axis table enumerates eight socio‑cognitive dimensions (e.g., Cultural Identity, Emotional Reactivity) and the number of condition values per axis, defining the 15 total experimental configurations.
Condition Pairings combine the axis values into concrete test cases, forming the full set of scenarios evaluated by SoCRATES.
Probing Conditions
How SoCRATES expands scenarios along five socio‑cognitive axes and evaluates them with targeted prompts.
The meta‑section details how SoCRATES expands each base scenario along five socio‑cognitive axes and how prompts drive the Topic‑Localized Evaluator.
**Table 5.** The 15 conditions per scenario, listed by axis. The general condition is the unexpanded baseline retained from agentic scenario curation (§3.2), while the remaining 14 are produced by applying one socio-cognitive axis to a fresh copy of the scenario.
**Table 16.** Prompt for party-axis expansion.
**Table 20.** Prompt for topic-localized evaluation.
Validation Methodology
This section details the annotation protocols, quality checks, and evaluator validation supporting the benchmark.
Task design pairs two dialogues generated from the same scenario, party role, opponent, and topic structure, altering only the target party’s reactiveness level. For each simulator backbone we sample 160 A/B pairs from the grid $\{0, 0.33, 0.66, 1\}$, yielding 1,120 comparisons across seven backbones, each labeled by three annotators.
Quality control aggregates the three labels per pair by majority vote, defines fidelity as the fraction of pairs where the selected dialogue matches the higher reactiveness level, and reports inter‑annotator agreement $α = 0.75$, sufficient to differentiate simulator backbones (see Table 1).
Two graduate‑student annotators (strong English proficiency) are supervised by a political‑science researcher who reviews the rubric and resolves procedural questions. In parallel we collect three annotations per snippet from MTurk workers meeting a 90 % approval rate, ≥500 approved HITs, and a 90‑point English comprehension test; the supervised set comprises 1,844 snippets from 144 conversations.
The graduate annotators achieve Krippendorff’s $α = 0.86$, establishing a high‑quality reference for evaluator validation.
Consensus is annotated at the snippet level: each snippet contains a back‑and‑forth exchange, background, topics, options, and the preceding snippet. Annotators record both parties’ option‑level positions and assign a 1–5 agreement score, carrying forward the previous score when an issue is omitted.
For validation we average the two supervised scores per topic‑snippet pair and compute Pearson correlation against SoCRATES, a non‑expert annotator, and a per‑turn LLM‑judge baseline. Correlations are reported at both trajectory and outcome levels.
**Figure 5.** Trend comparison of consensus score trajectories for ProMediate and SoCRATES. Bold lines show the average trajectory across dialogues, while faint lines in the background depict individual mediation trajectories, illustrating the variability across conversations.
**Table 6.** Evaluator alignment with expert judgments (Pearson r) using Qwen3-235B-A22B-Instruct as the backbone. Values in parentheses denote p-values.
Mediator prompting runs at temperature 0.6. At each party turn the mediator first executes the when‑to‑intervene decision (Table 18); if true, the how‑to‑intervene step (Table 19) generates a single utterance inserted before the next party speaks.
**Table 7.** Intervention behaviors of eight mediators. IF: Intervention Frequency, FI: First Intervention.
Benchmark stability is examined along three fixed axes: evaluator backbone, party‑agent simulator backbone, and run‑to‑run stochasticity. The reported $\Delta$ DS values (79.2, 21.3, 25.9) quantify performance variation across these dimensions.
**Table 18.** Prompt for mediator intervention decision.
Scenario Curation Prompts
Prompt the user to specify the query for conflict analysis and negotiation simulation.
Could you please provide the specific query you’d like me to research so I can prepare the detailed conflict analysis and negotiation‑simulation JSON you requested?
Scenario Recasting Prompts
Provides a concrete recast of a real‑world hospital‑closure dispute for LLM mediation testing.
The SoCRATES (Socio‑Cognitive Reliable Automated Testing and Evaluation System) benchmark framework uses this case to probe how language‑model mediators handle complex, multi‑party disputes. MSBI is a roughly 700‑bed acute‑care hospital in Manhattan’s East Village, operated by the private nonprofit Mount Sinai Health System, serving the Lower East Side, East Village, and Chinatown neighborhoods.
2013: Mount Sinai Health System formed through a merger with Continuum Health Partners, absorbing Beth Israel Medical Center. 2016‑05: The system announced a \$500 million “Downtown Transformation” plan to replace the large inpatient facility with a 70‑bed acute‑care site and an expanded outpatient network. 2016–2022: Delays followed regulatory and community push‑back over loss of inpatient psychiatry, addiction services, and 24/7 emergency care.
2023‑09: Mount Sinai filed an updated closure plan with the New York State Department of Health (NY DOH); community groups responded with Article 78 litigation and protests. 2024: NY DOH held public hearings, preliminarily requiring the system to maintain certain emergency and behavioral‑health services during the transition.
Stakeholders include Mount Sinai Health System (objective: wind‑down to stop ≈ \$150 M annual losses; BATNA: unilateral filing subject to NY DOH approval), NY DOH (regulatory authority, concerned with continuity of essential services), and the Coalition to Save Beth Israel with Community Board 3 (advocacy coalition demanding preservation of inpatient psychiatry, addiction treatment, and 24‑hour emergency access).
The core disagreement centers on whether the planned 70‑bed facility can adequately replace the existing 700‑bed hospital’s inpatient psychiatry, addiction services, and emergency care. Institutional tension arises from the regulator’s Certificate of Need authority clashing with local community demands for service continuity.
As of 2024, the closure application remains pending; NY DOH’s conditions require temporary maintenance of critical services while the final decision on the full shutdown is awaited.
Supplementary Prompt Library
Appendix provides the remaining prompts and annotation templates for scenario generation and evaluation.
Table 12 supplies a deep‑research seed for the Healthcare domain; Table 13 shows the recast scenario built from that seed; Tables 14–20 define the series of prompts used to generate preferences, simulate party behavior, expand parties and histories, decide mediator interventions, and evaluate topic‑localized agreement.
**Figure 6.** Mediator adaptation of three mediators under two disputant simulators (DeepSeek-V3.2, solid line; Qwen3-235B-A22B-Instruct, dashed line).
**Figure 7.** Example of annotation template for pairwise simulation fidelity evaluation.