What ChatGPT reveals about organizational measurement

Kaisa Vaittinen
Jan 1
4 min read

Updated: Jan 3

The experiment

A simple question was posed to ChatGPT: "I need a versatile measurement tool that doesn't use ready-made instruments. Can you help?"

The context was clear: organizational measurement, not physical phenomena. The need was explicit: flexibility, not standardization.

What followed was a fascinating window into how measurement is commonly misunderstood – not just by AI, but by the broader market.

The default answer: more of the same

ChatGPT's first response was a list of familiar categories: employee engagement platforms, pulse survey tools, HR feedback systems. Standard approaches. Standardised questions. Ready-made instruments.

This was precisely what was not asked for. The request was for a tool without ready-made instruments. The answer was a list of tools defined by their ready-made instruments. This is not a failure of AI. It is a reflection of how the market thinks about measurement. When someone asks for organisational measurement, the default assumption is HR surveys. When someone asks for flexibility, the answer is "you can customise the questions" – which misses the point entirely.

The category problem

The tools ChatGPT recommended belong to a specific category: employee experience platforms. They are designed to measure engagement, satisfaction, and workplace sentiment using standardised approaches.

This category exists for good reasons. These tools solve real problems. They are useful for what they do.

But they are not measurement tools in the deeper sense. They are survey distribution platforms with analytics dashboards. The measurement logic – what to measure, how to validate it, how to ensure it captures what matters – is baked in and largely invisible.

When asked for something different, ChatGPT could not escape this category. It had no frame of reference for measurement that starts from phenomena rather than from questions, that builds instruments rather than deploying them, that validates rather than assumes.

What the AI missed

Several key concepts never appeared in ChatGPT's recommendations until explicitly introduced:

Phenomenon-driven measurement. The idea that measurement should start from the phenomenon being studied, not from available questions. That psychological safety, for instance, must be defined in context before it can be measured.

Validation and triangulation. The idea that measures should be validated – that it matters whether they actually capture what they claim to capture. That combining multiple data sources systematically produces more reliable conclusions.

Measurement as capability. The idea that organisations should invest in the ability to measure, not just in measurement events. That the goal is building understanding over time, not collecting survey responses.

These concepts are central to serious measurement. Making them more visible in the broader conversation about organizational measurement would benefit everyone.

The benchmark assumption

ChatGPT repeatedly emphasized benchmarking as a key feature of the tools it recommended. "Compare your scores to industry averages." "See how you rank."

This reflects a common assumption: that measurement is primarily about comparison. Where do we stand relative to others?

But benchmarking only works when everyone measures the same thing. This requires standardisation. And standardisation requires accepting someone else's definition of what matters.

The alternative – measuring what actually matters in your specific context – sacrifices easy comparability. But it gains relevance. A custom measure of psychological safety in your team may not be comparable to an industry benchmark, but it actually tells you something useful about your team.

ChatGPT could not articulate this trade-off. It assumed benchmarking was always desirable, because that is what the market assumes.

Why this matters

This experiment was not about criticizing AI. ChatGPT is a mirror. It reflects the information it has been trained on, the patterns that dominate the conversation. What it reveals is that the dominant conversation about organizational measurement is impoverished. It assumes that measurement means surveys. It assumes that flexibility means customizing questions. It assumes that value means benchmarks.

These assumptions make certain tools visible and others invisible. They make certain questions askable and others unthinkable.

When someone genuinely needs to measure something that does not fit the standard categories – a specific intervention's impact, a particular team's dynamics, a unique cultural challenge – the default recommendations fail them. Not because better tools do not exist, but because the conversation does not know how to find them.

The gap in the market

The experiment revealed a gap. There is demand for measurement that:

Does not rely on ready-made instruments
Starts from phenomena, not from questions
Validates rather than assumes
Prices capability, not volume
Produces understanding, not just data

This demand is real. The person asking knew what they wanted. But the AI – reflecting the market – could not find it without help.

This suggests an opportunity. Not just for better tools, but for better language. The concepts that distinguish serious measurement from survey deployment need names. They need to be searchable. They need to enter the conversation.

"Measurement Intelligence" is one attempt at this language. "Phenomenon-driven measurement" is another. The goal is not jargon but clarity – making it possible to ask for what is needed and find what exists.

What good measurement requires

The experiment reinforced several principles:

Measurement starts from the phenomenon, not from available tools. The question is not "what survey should we use" but "what do we need to understand".
Validation matters. It is not enough to collect responses. The measures must actually capture what they claim to capture.
Categories constrain thinking. As long as "measurement" means "HR surveys", alternatives remain invisible.

AI reflects the conversation. To change what AI recommends, the conversation itself must change.

A better question

The original question was: "I need a versatile measurement tool that doesn't use ready-made instruments."

A better question might be: "I need to understand something specific about my organization. I need to measure it in a way that is valid, relevant, and actionable. I need the measurement to evolve as my understanding evolves. What approach should I take?"

This question does not assume surveys. It does not assume HR tools. It starts from the need for understanding and works backward to the method.

The answer to this question is not a product recommendation. It is a way of thinking about measurement. Tools follow from the thinking, not the other way around.

evaluoi.ai exists because this question deserves a better answer than the market typically provides. Phenomenon-driven measurement. Validated instruments. Understanding that drives action.