Human Psychometric Questionnaires Mischaracterize LLM Behavior

Woojung Song, Dongmin Choi, Yoonah Park, Jongwook Han, Eun-Ju Lee, Yohan Jo

Human psychometric questionnaires fail to predict LLM behavior because they rely on transparent lexical cues rather than stable dispositions.

Do human psychometric questionnaires actually measure LLM personality/values, or are they just measuring the model's ability to mimic survey-taking behavior?

Researchers often use human personality and value questionnaires to profile Large Language Models (LLMs), assuming these scores predict how models behave in real-world interactions. The authors show that these questionnaire profiles diverge sharply from actual generation behavior, as models recognize explicit lexical cues in survey items and respond with socially desirable, alignment-consistent answers. When tested on realistic user queries that lack these cues, the internal construct structure of questionnaire-based profiles disappears, revealing that models do not possess the stable psychological dispositions these tests claim to measure.

Paper Primer

The study compares two profiling methods across eight open-source LLMs: Likert-based self-reports (e.g., BFI-44, PVQ-40) and generation probability scores derived from the Value Portrait dataset. The core move is to measure the log-probabilities of a model generating construct-validated responses to realistic user queries, bypassing the "reflective" self-report format that invites alignment-driven bias.

Questionnaire-based profiles and generation-based profiles show low construct-level agreement.

Spearman’s $\rho$ for cross-method agreement drops to 0.11–0.36, compared to 0.74–0.77 for within-method consistency. Cross-method correlation is roughly 40% of the within-method reference, with several models yielding negative correlations.

Persona-induced shifts in questionnaire responses do not translate to generation behavior.

Persona prompts (e.g., "elderly," "right-wing") shift Likert scores in stereotype-consistent ways, but produce directionally incoherent generation-probability shifts. Generation-based persona shifts show a mean cosine similarity of -0.03 with human demographic patterns, indistinguishable from chance.

Why do LLMs show consistent construct structure on questionnaires if they lack stable dispositions?

The items in established questionnaires contain explicit lexical cues that allow models to identify the target construct and adjust their responses to match alignment-consistent, socially desirable norms.

What is the primary limitation of the generation-probability method used here?

The method relies on fixed, construct-validated candidate responses, which provides a controlled behavioral proxy but does not capture how psychological tendencies manifest in unconstrained, free-form generation.

Researchers should treat questionnaire-based LLM profiles as measures of alignment-driven response patterns rather than stable psychological traits, and prioritize generation-based evaluation for predicting real-world model behavior.

The Psychometric Mimicry Problem

We expose why questionnaire scores misrepresent LLM behavior in real interactions.

Human psychometric questionnaires have become a popular shortcut for profiling large language models, but this practice assumes that models’ questionnaire answers reflect their actual generative behavior. In reality, the scores derived from Likert‑style self‑reports diverge sharply from profiles built on generation probabilities over realistic user queries.

The models learn to imitate the surface form of questionnaire items without adopting the underlying behavioral dispositions those items aim to measure.

We formalize four research questions to probe this phenomenon. RQ1 asks whether construct‑level profiles from questionnaires and from generation probabilities agree; RQ2 examines whether intra‑construct item consistency—often taken as evidence of stable dispositions—holds for generation‑based scores. RQ3 investigates item textual transparency as the causal mechanism behind any divergence, and RQ4 tests whether persona prompts that shift questionnaire scores also shift generation‑based profiles.

The essential gap is that survey‑taking formats capture surface compliance, not the deeper behavioral reality of LLMs.

Prior Psychometric Assessment of LLMs

Prior studies on LLM psychometrics and the rise of generation‑probability probing.

Research on large language models (LLMs) has long borrowed psychometric questionnaires—most notably the Big Five Inventory (BFI) and the Portrait Values Questionnaire (PVQ)—to infer personality traits and values from model outputs.

BFI and PVQ are standardized surveys that map responses onto a fixed set of personality factors or value dimensions, providing a common language for comparing individuals.

**Table 1.** Comparison with prior work on the gap between questionnaire responses and behavior in LLMs. ◑ indicates partial or implicit support.

Beyond questionnaire scores, a growing literature probes LLM behavior with scenario‑action items, situational‑judgment tests, and downstream tasks, exposing a systematic gap between self‑reported traits and observable actions.

Generation‑probability measurements read the model’s log‑probabilities for candidate responses, offering a prompt‑robust proxy for behavior that avoids reliance on the model’s meta‑judgments about its own dispositions.

Comparing Survey Scores and Generation Probabilities

We describe two lenses for extracting psychological profiles from LLMs and how they are operationalized.

We compare two profiling lenses: traditional questionnaire scoring and a generation‑probability approach that treats log‑probabilities of construct‑aligned responses as behavioral evidence.

Instead of asking the model to self‑report, we let it “show” a trait by measuring how likely it is to emit a response that humans have already linked to that trait.

How does VP scoring differ from simply taking the highest‑probability response for each construct?

VP scoring aggregates *all* tagged responses within a scenario, not just the top one. This reduces variance caused by a single outlier response and captures the model’s overall inclination toward the construct across multiple plausible utterances.

Scenario A average = (‑2.0 + ‑3.0) / 2 = ‑2.5.

Scenario B average = (‑1.5 + ‑4.5) / 2 = ‑3.0.

Macro‑average score(C) = (‑2.5 + ‑3.0) / 2 = ‑2.75.

The score reflects the model’s overall preference for construct C across diverse contexts, not just a single “best” utterance.

We treat each psychological construct (e.g., Openness, Benevolence) as a dimension and assign it a numeric profile derived from the VP scores of all scenarios that mention it.

These three numbers form the partial value profile vector (‑2.1, ‑3.4, ‑1.8).

Missing constructs are filled with their own VP scores computed analogously, yielding a full 10‑dimensional vector.

The profile captures the relative strengths and weaknesses of the model across all measured values, enabling direct comparison with human norm vectors.

Compute VP scores for all constructs.

We evaluate eight open‑source models (Gemma‑3, Qwen 2.5, Qwen 3, GPT‑OSS) in both small and large variants, applying the two profiling lenses uniformly.

Results: Do Questionnaires Predict Behavior?

Questionnaire scores diverge from generation‑based profiles and lack consistent construct structure.

The central premise—that questionnaire scores reflect prompt‑format artifacts rather than genuine model dispositions—has been established; we now probe four concrete research questions.

Persona prompting prepends a short description of a demographic or social identity to the model’s context, steering its subsequent outputs toward that imagined persona.

How does persona prompting differ from simply adding a demographic keyword to the query?

In persona prompting the description is a full sentence that the model can attend to as context, not a token‑level tag; this gives the model a richer narrative to condition on, affecting both lexical choice and higher‑level value expression.

Cross‑method rank correlation drops by roughly 0.45 on average compared with within‑method agreement.

Table 2 shows Spearman $\rho$ values around 0.30 for generation‑vs‑questionnaire versus ≈0.75 for questionnaire‑vs‑questionnaire.

The rank‑trajectory plots in Figure 15 reveal that while questionnaire‑based rankings stay parallel across models, generation‑based rankings cross frequently, confirming model‑specific directionality of the divergence.

**Figure 15.** Schwartz value rank trajectories across the three measurement approaches (PVQ-40, PVQ-21, VP generation probability) for each model. The two leftmost columns (PVQ-40 and PVQ-21) are within-method references; the rightmost column (VP) is the generation probability profile. Each line represents one of the 10 Schwartz values; crossing lines indicate rank disagreement.

Established questionnaires explain about 53 % of item‑score variance (average $\eta$²≈0.53), whereas generation probability scores explain essentially none ($\eta$²≈0.00).

Table 3 reports $\eta$² values of 0.526 for PVQ‑40 and 0.492 for BFI‑44, contrasted with near‑zero $\eta$² for the VP profiles.

Item‑construct recognition F1 for VP items hovers around 0.08, essentially chance, while established items achieve F1≈0.75.

Table 4 lists VP F1 scores between 0.05 and 0.11 versus 0.49–0.99 for questionnaire items.

**Figure 1.** Cosine similarity heatmaps for PVQ-40 and VP value items. (a, b) Item–definition similarity. (c, d) Within-construct item similarity. Established items (a, c) show diagonal structure; VP items (b, d) do not.

Persona‑induced VP shifts are essentially uncorrelated with human demographic patterns (average cosine ≈ 0.00), whereas questionnaire‑based shifts show strong positive alignment (average cosine ≈ 0.70).

Table 5 reports mean VP cosine −0.03 versus +0.60/+0.47 for the two questionnaire variants.

Questionnaire scores and behavioral probabilities are largely uncorrelated.

Generation Probability Scoring Details

Ablation analyses of aggregation choices, metrics, and statistical significance.

Each scenario contributes one average log‑probability per construct, preventing scenarios with many tagged responses from dominating the construct score.

How does this macro‑average differ from the flat micro‑average?

The micro‑average computes a single mean over *all* tagged responses, so a scenario that happens to produce many $c$‑tagged outputs pulls the overall score toward its context. The macro‑average first normalizes within each scenario, then treats every scenario equally, eliminating that bias.

For each model $i$, compute the paired difference $d_i = \rho_{\text{within},i} - \overline{\rho}_{\text{cross},i}$.

Enumerate all $2^{8}=256$ possible assignments of a $+$ or $-$ sign to the eight $d_i$ values.

For each sign pattern, calculate the mean of the signed $d_i$.

The one‑sided $p$‑value is the fraction of sign patterns whose mean is at least as large as the observed mean.

Repeat the enumeration for the pooled set of $16$ differences (8 models $\times$ 2 construct groups) to obtain the overall $p$‑value (exactly $2^{16}=65{,}536$ patterns).

Draw $10{,}000$ bootstrap resamples of the $d_i$ values and report the $95\%$ percentile interval for the mean difference as a descriptive complement.

**Table.** Model performance metrics across five personality traits: Agreeableness, Conscientiousness, Extraversion, Neuroticism, and Openness.

**Table 10.** Per-model paired differences ($d_i = \text{within}_i - \text{cross}_i$) for Spearman $\rho$. Positive values indicate the within-method baseline exceeds generation-probability-established agreement.

**Table 11.** PVQ-40 established questionnaire scores (construct-level means, prompt-averaged) for each model. These are Likert-scale averages on the 10 Schwartz values.

**Table 12.** Generation probability profile scores for the 10 Schwartz values. Values are mean total log-probabilities of construct-tagged VP responses ($r \geq 0.3$).

**Table N/A.** Model performance metrics across various categories (Ach, Ben, Con, Hed, Pow, Sec, SD, Sti, Tra, Uni, $\eta^2$, WMV).

**Table 14.** Generation probability profile scores for the Big Five traits. Values are mean total log-probabilities of construct-tagged VP responses ($r \geq 0.3$).

Permutation Test Results

Appendix G reports permutation tests and within‑construct variance revealing divergent patterns between questionnaires and generation‑probability profiles.

The appendix first presents permutation‑test statistics for both traditional questionnaires and the VP scoring method, then examines how tightly items cluster within each construct.

**Figure 16.** Big Five trait rank trajectories across the three measurement approaches (BFI-44, BFI-10, VP generation probability) for each model. The two leftmost columns (BFI-44 and BFI-10) are within-method references; the rightmost column (VP) is the generation probability profile. Each line represents one of the 5 Big Five traits.

Together the permutation‑test $\eta^2$ values and WMV metrics demonstrate that questionnaire‑based scores are internally consistent, whereas VP profiles are markedly less coherent, especially for Neuroticism.

Supplementary RQ3 Materials

Appendix H dissects per‑construct recognition and embedding analyses, exposing systematic gaps between established questionnaires and VP items.

This appendix expands the two analyses from §4.3: (1) per‑construct $F_1$ scores that pinpoint which traits are recognisable, and (2) sentence‑embedding diagnostics that test whether construct identity is signalled by surface text.

Large LLMs achieve near‑perfect BFI trait recognition while the smallest model falls to just over half the score on the hardest traits.

Table 21 shows GPT‑OSS‑120B and Qwen3‑235B reaching $F_1$ ≥ 0.98 on all five BFI traits; Gemma3‑4B’s worst scores are $F_1$ = 0.53 on Agreeableness and Neuroticism.

PVQ items consistently expose a high‑recognition / low‑recognition split: Achievement and Hedonism are easy, whereas Security, Power and Tradition are hard.

Table 22 reports $F_1$ ≥ 0.85 for Achievement and $F_1$ ≥ 0.75 for Hedonism across models, while Security, Power and Tradition often dip below $F_1$ = 0.60.

VP items remain poorly recognised; the best construct (Agreeableness) never exceeds $F_1$ ≈ 0.27.

Table 23 shows the highest VP $F_1$ values at 0.27 for Agreeableness (Gemma3‑4B) and 0.25 for Stimulation (Gemma3‑4B); all other constructs stay below 0.20.

Established questionnaires yield 77–81 % Top‑1 accuracy and discrimination of 0.13–0.22, whereas VP items hover at chance level.

Table 24 reports Top‑1 accuracy 0.77–0.81 and discrimination 0.13–0.22 for PVQ‑40 / BFI‑44; VP items achieve near‑zero discrimination and ≈0 % accuracy.

Item‑within‑construct clustering gaps are positive (0.07–0.15) for established questionnaires but essentially zero (−0.001 to +0.004) for VP items.

Table 25 shows gaps of 0.07–0.15 for PVQ‑40 and BFI‑44, while VP rows report gaps between −0.001 and +0.004.

Across five sentence encoders, the superiority of established questionnaires over VP items persists, confirming the pattern is encoder‑invariant.

Table 27 lists consistently higher discrimination (≈0.13) and clustering gaps (≈0.12) for PVQ‑40 / BFI‑44 than for VP items, regardless of encoder.

**Figure 17.** Item-definition discrimination across five sentence encoders. Established questionnaires (PVQ-40, BFI-44) consistently show higher discrimination than VP outputs, regardless of encoder choice.

**Figure 18.** Within-construct clustering gap across five sentence encoders. Established questionnaires show positive gaps (within-construct items are textually more similar), while VP outputs show near-zero gaps.

**Figure 19.** Cosine similarity heatmaps for BFI-44 and VP BFI-trait items, parallel to Figure 1. (a, b) Item–definition similarity. (c, d) Within-construct item similarity. Established items (a, c) show clear diagonal structure; VP items (b, d) do not.

Conclusion and Limitations

We recap the misalignment between questionnaire scores and generation probabilities, then outline key study limitations.

Questionnaire‑based psychometric scores capture how language models answer survey prompts, yet they diverge from the models’ own generation‑probability scores. This gap shows that questionnaire results reflect prompt format rather than genuine underlying dispositions.

Across eight open‑source LLMs we find (1) questionnaire profiles do not align with generation‑probability profiles; (2) the apparent construct structure of questionnaire‑based profiles is largely driven by the transparency of item wording; and (3) persona prompting shifts questionnaire responses in stereotype‑consistent ways but produces directionally incoherent generation‑probability shifts that do not mirror human demographic patterns.

These findings imply that questionnaire scores alone are insufficient evidence of model‑level psychological characteristics. We therefore recommend that future work supplement or replace established questionnaires with generation‑probability‑based evaluation when the goal is to characterize model behavior rather than responses to self‑report instruments.

Our framework depends on token‑level log‑probabilities, which confines the current analysis to open‑source models. Extending the approach to closed‑source models would require either API‑level log‑probability access or alternative probability‑estimation techniques.

We also rely on the Value Portrait dataset, which covers Schwartz values and Big Five traits within a single benchmark; expanding the scenario pool to include additional psychological constructs and more diverse contexts would improve generalizability. The persona‑prompting comparison uses the European Social Survey, providing Schwartz value data but lacking Big Five measures, so incorporating large‑scale personality surveys would enable parallel analysis for the trait domain.

Finally, we do not provide a matched human baseline for RQ1‑RQ3, which is needed to determine whether the observed gaps are specific to LLMs or reflect the well‑known self‑report/behavior gap in human psychology. Our generation‑probability score is a controlled behavioral measurement that sits between Likert questionnaires and unconstrained generation—it is invariant to sampling temperature and paraphrase variation but does not capture how psychological tendencies manifest in freely generated text. Complementing this design with analysis of open‑ended outputs is a natural next step.

**Table 6.** Schwartz’s 10 basic human values and their definitions, adopted from Schwartz (2012).

Inference Configuration

Inference setup and generation parameters for the experiments.

All open‑weight models were served locally with vLLM v0.16.0 on a single node equipped with four NVIDIA A100 80 GB PCIe GPUs. Models were loaded in bfloat16 precision, except Qwen3‑235B‑A22B, which uses an FP8‑quantized checkpoint while the compute dtype remains bfloat16. No additional post‑training quantization such as AWQ or GPTQ was applied.

For the established survey (Likert scoring) we called the /v1/chat/completions endpoint, kept the model’s native chat template, set temperature = 0.0 and `max_tokens` = 1024 to obtain deterministic responses.

For Value Portrait log‑probability scoring we used the /v1/completions endpoint with echo = True, `max_tokens` = 1, and temperature = 1.0; the trailing token is discarded, and the logged probabilities of the echoed token reflect the model’s distribution over the response. The full user–assistant turn was built with the model’s chat template (`add_generation_prompt` = False), then response‑token log‑probs were extracted by slicing at the prompt boundary.

Prompt Templates

Appendix D details the prompt templates for questionnaires, generation, and construct recognition.

For each established questionnaire (PVQ‑40, PVQ‑21, BFI‑44, BFI‑10) we define two prompt variants that keep the original item wording but reverse the order of response options. Variant 1 presents options from high to low, while Variant 2 presents them from low to high. This design isolates the effect of response ordering on model answers.

In the generation‑probability condition we use two prompt templates that differ only in the source context of each VP scenario. No Likert‑scale options are shown; instead the model’s log‑probability for each of five hidden candidate responses is computed as described in §3.2. The candidates are evaluation targets only and are never displayed to the model during generation.

The VP Generation Prompt comes in two flavors: a Human‑Human advisory context where the model is asked to react naturally to a described scenario, and a Human‑LLM chat where the model replies to a message as if in a conversation. In both cases the model’s response is recorded verbatim for later probability scoring.

For item‑construct recognition we pair each questionnaire item with a candidate construct and its definition, then ask the model to answer “yes” or “no” to whether the item primarily measures that construct. This binary format is applied to both PVQ items and BFI statements, including reverse‑scored BFI items.

To validate that VP candidate log‑probabilities lie within a reasonable region of each model’s generation distribution, we sample ten free‑form responses per model at temperature = 1 for every VP scenario. We then compute the log P of each sampled response using the same echo‑based method as the VP evaluation and compare these values to the VP candidate scores.

Persona Prompting Details

Methodology Summary outlines persona prompts, centering, delta computation, and statistical tests for LLM‑human value profile comparison.

The study first defines eight demographic personas and injects them as system prompts before any Likert or generation‑probability query.

**Table 28.** Persona prompt wording for each demographic condition (RQ4). All prompts are injected as the system message; the user-turn instruction and survey items remain identical to RQ1.

Human reference profiles come from the European Social Survey (ESS) Round 11, yielding 37,398 complete respondents after filtering.

This table presents statistical results for three different surveys (PVQ-40 Likert, PVQ-21 Likert, and VP Gen-Prob) using Bootstrap and Permutation methods.

Statistical significance of direction‑match is assessed with a one‑sided binomial test (null $p=0.5$) and aggregated across conditions.

**Table 30.** Direction agreement by condition with binomial test p-values ($H_0$: agreement = 50%, one-sided). Bold indicates $p < 0.05$.

Per‑condition cosine similarities and direction‑match counts are summarized in Table 31, while Table 32 aggregates these metrics across the seven LLMs.

The table presents the Mean ES and Median ES values for four different sources: PVQ-40 Likert, PVQ-21 Likert, VP Gen-Prob, and Human (ESS).

**Table 32.** Per-model cosine similarity between LLM deltas and human deltas. For each model, the eight per-condition 10-dimensional deltas are concatenated into a single 80-dimensional vector, and one cosine is computed against the analogously concatenated human delta; means are then taken across the seven models. This concatenated form is sensitive to large-magnitude dimensions and therefore differs from the per-condition cosine averages reported in Table 5 of the main text (PVQ-40 mean +0.60 vs. +0.43 here). PVQ–VP shows the direct agreement between Likert and generation-probability deltas under the same concatenated procedure.

**Table 33.** Cross-model consensus on delta direction. “Strong” = $\ge$ 6/7 models agree; “Weak” = < 5/7 agree. Percentages are over all 80 (condition $\times$ value) pairs. Per-model directions are taken from sign($\Delta$) with zero (no-shift) values counted as part of the negative-direction group; rows with exactly 5/7 agreement fall in the intermediate (neither-strong-nor-weak) category and therefore do not appear in either count.

**Table 34.** Bootstrap 95% confidence intervals and permutation test results for aggregate cosine similarity (Human vs. LLM average delta). Resampling is over the 8 demographic conditions (10,000 iterations each).

Value‑level delta profiles (Table 36 excerpt) illustrate the direction of each dimension’s shift for humans versus LLMs, highlighting where the models diverge from human patterns.

Finally, cross‑model variance (Table 39, referenced) indicates that political conditions produce the greatest disagreement among models, whereas gender and education conditions are more stable.

Model-Level Delta Tables

This section details the delta tables and statistical analyses comparing human and LLM persona effects.

Persona prompts (Table 28) are injected as system messages before the model generates either Likert ratings (PVQ‑40, PVQ‑21) or generation probabilities (VP). The ESS sample (Table 29) provides 37 398 human respondents, enabling ipsative centering to remove individual scale bias before computing deltas. Deltas are then compared with cosine similarity, direction‑match counts, and standardized effect sizes.

Read the original paper

Open the simplified reader on Paperglide