Clearing the Consensus Bottleneck: How AI Could Help Query the Right Humans

By Phil Howard

May 1, 2026 - My colleagues and I have spent the last year trying to answer a question we cannot answer with current research. Does social media harm adolescents? Clinical case reports, qualitative interviews, and behavioral audits often describe specific damage in specific people. Researchers at the Center for Countering Digital Hate set up fresh TikTok accounts at the platform's minimum age of 13, paused briefly on body-image content, and watched the algorithm serve suicide content within three minutes and eating-disorder content within eight. In contrast, the giant population-scale analyses find effects so small that Amy Orben and Andy Przybylski, in a widely cited 2019 paper, compared the negligible effects of technology use on teens to the harm of eating potatoes.

Both findings are carefully composed research. But we have a consensus bottleneck. National funders fund disciplinary work. Universities promote within disciplines. Conferences cluster by method. AI is accelerating discovery, but the consensus apparatus has barely evolved.

The bottleneck is organizational, not technological

The consensus problem is not principally about method. We have good methods. The people running them rarely encounter each other, and the methods themselves — systematic reviews, expert surveys, meta-analyses, randomized trials — were built for a quieter age, before language, volume, and disciplinary spread overwhelmed them.

Existing organizations cannot fix this. National science agencies are domestic by mandate. Large funders privilege the hard sciences and are wary of organizational work that does not produce a dataset. Disciplinary societies optimize for membership retention, not boundary-crossing. Lancet commissions, National Academies reports, and WHO panels are slow, expensive, narrow in panel composition, and biased toward expertise legible to the convening institution. By the time a major commission produces its report on adolescent mental health, the platforms it studied have changed and a generation of users has aged through.

In our Oxford Martin School work, we encountered this gap repeatedly: not just in adolescent mental health, but online radicalization, election integrity, vaccine hesitancy, behavioral addiction too. The literature does not converge. The relevant experts cannot be assembled. The policy clock does not wait. Platforms design and release technologies without much independent review of safety.

The response has been a wave of new, scientist-led, network-based organizations: pop-up journals, Focused Research Organizations, living meta-analyses, problem-specific scientific panels, all built because the existing structures do not scope what we need now. We need an intervention that does for scientific consensus what arXiv did for preprints: infrastructure that lowers a coordination cost so sharply that the community routes around the existing system.

‍

The hypothesis

If we can use AI to locate where relevant human expertise actually sits across a global research network, prompting experts with structured questions whose responses can be aggregated, then we can assemble agreement and mapped disagreement on contested empirical questions faster, more inclusively, and at lower cost than current convening institutions allow.

This is a hypothesis with clear failure conditions, not an institutional proposal. It is also not a claim that AI will read the papers for us or run the meta-analysis. Those moves embed AI in existing scientific labor and reinforce existing biases, because both are constrained to what is already published in indexed venues, in dominant languages, by credentialed authors.

Testing this hypothesis means exploring the opposite claim. As well as having researchers query AI, the more valuable role for AI in science right now is to have AI prompt humans.

Getting to where the knowledge lives

Scientific consensus depends on the completeness of the evidence it draws from, so running this experiment would get us three big achievements.

Para-academic knowledge. PhDs working in journalism, government, industry, civil society, and multilateral agencies produce rigorous evaluations and field analyses invisible to a PRISMA-based review. Their findings rarely reach PubMed, but may appear in OpenAlex.
The non-Anglophone scholarly record. Vibrant literatures on technology, behavior, and public health are indexed through CNKI (Chinese), Redalyc (Spanish and Portuguese), and Episciences (French), and empirical findings from populations the English-language literature barely studies. (We still need close evaluation of all knowledge production systems.)
What's in the file drawer. Coined by Robert Rosenthal in 1979, the term names the bias that arises when null results stay unpublished. Asking researchers directly what did not work recovers what publication selects out.

AI-mediated elicitation may be one of the few practical ways to find this work, evaluate it, and bring its authors into structured conversation.

Denario is an exciting experiment that integrates literature search, hypothesis generation, code execution, and manuscript drafting through a transparent multi-agent architecture. Tested across 12 disciplines, roughly one in ten of its outputs surfaces a question worth pursuing, and the team is candid that it sometimes fabricates data or citations. The Santa Fe Institute's Project ARCH, offers an intriguing direction: an orchestration layer that lets academic communities deploy and customize AI tools across institutions, with shared data infrastructure and open-source release. We are already using AI to query data; the next big step is to use AI to query humans.

The experiment: Using AI to encourage researchers

The pilot is concrete and capable of failing. Pick a single contested empirical challenge, for instance the magnitude of effect of recommendation-driven content exposure on disordered eating in adolescent girls. Run two parallel processes to produce a structured findings document.

The control is conventional. Assemble a panel through a National Academies–style process. Weight by citation count and institutional prestige. Run a year-long deliberation. Publish an assessment.

The intervention is AI-mediated network elicitation. Use multilingual models with large-scale retrieval to discover, screen, translate, and integrate research across global publication, clinical practice, gray literature, and platform integrity work. Identify candidate experts, including those not central in citation networks but whose work bears directly on the question. Generate structured, translatable prompts and route them through deliberation systems like Pol.is that use clustering algorithms, argument mapping, and structured anonymity to organize expert reasoning. Iterate: surface disagreements, route follow-up questions to those best positioned to resolve them, learn which experts are most informative as the elicitation runs. Produce a structured document of agreement, mapped disagreement, and identified gaps, with explicit dissents preserved.

Evaluate both processes on three dimensions. The first is speed: how many months to convergence. The second is inclusiveness: what proportion of contributing voices come from non-Anglophone institutions and from the Global South. The third is validity: how well the structured findings hold up against subsequent empirical work.

Perhaps the best evidence of success will be epistemological. What knowledge gaps are closed, quickly?

The experiment is informative, whichever way it lands. If the AI-mediated process matches or beats conventional convening on speed and inclusiveness while preserving validity, we have evidence for a new organizational primitive worth scaling. If it underperforms, if AI-surfaced experts prove systematically less reliable, or if the model amplifies the same Anglophone biases that dominate the existing literature, we have learned something specific about where AI does and does not belong in the seams between research communities.

This is an instance of an older idea. Stephen Barley's work on radiology departments adopting CT scanners showed that the consequential effects of new technology are rarely in what it does to the existing task. They are in how it reroutes who talks to whom, restructuring authority, expertise, and the lines along which knowledge moves through an organization. AI's most consequential role in science is not in the lab. It is at the seams between labs.

Why now?

Three enabling conditions have arrived together. Large language models are now usable across dozens of languages with retrieval over heterogeneous corpora. Structured human-elicitation tools developed over the last decade — Pol.is, Delphi-style platforms, prediction markets, networked deliberation systems — can be plugged in. And the policy demand is acute. Governments in Brussels, Westminster, and Washington are writing platform regulation right now, on the basis of literature that visibly does not converge. From our field-monitoring work at Oxford, we see new experimental and quasi-experimental studies on platform interventions appearing regularly, and static reviews are out of date soon after publication. Waiting another decade for a Lancet commission per question is not viable.

If AI-mediated elicitation can produce credible structured findings on monthly rather than yearly timescales, the bottleneck shifts to dissemination, and the traditional journal apparatus is not built for that pace. Adjacent experiments include Rapid Reviews, the Harvard Data Science Review, the Pop-Up Journal Initiative, the broader publish-curate-review movement. Each treats the scholarly article as a continuous pipeline of knowledge rather than a fixed event or static artifact. The proposed experiment feeds those efforts, and they feed it back.

Who builds it?

This is not work a single university can do, or that a national funder will scope. It needs a dedicated, professionalized team to build and maintain the infrastructure, design elicitation protocols, run the pilots, and publish the methodology openly. The Denario and Project ARCH precedents are clear: code alone is not sufficient; the global team is necessary.

The natural place to field-test this hypothesis is in the class of new, scientist-led, network-based organizations mentioned above. These are the natural sites for testing this hypothesis: places where gaps in knowledge appear and we need help identifying who can close them.

The deliverables are twofold. First, structured findings on contested empirical questions produced fast enough to inform live policy debates and inclusive enough to draw on expertise the current system loses. Second, a training pathway for a new kind of metascience practitioner, fluent in network-level elicitation, cross-linguistic aggregation, and the design of AI-mediated convening protocols.

Until we can ask the world's experts together, we’ll keep encountering irreconcilable observations: TikTok serving suicide content within three minutes to individual users, and globally aggregated effects no larger than the harm of eating potatoes. Both findings may be real but need to be reconciled. Let’s run the experiment.

Chinese

French

Russian

Spanish

Download the news in more languages

تحميل الأخبار (PDF)下载新闻 (PDF)Télécharger les actualités (PDF)Nachrichten herunterladen (PDF)ニュースをダウンロード (PDF)뉴스 다운로드 (PDF)Baixar notícias (PDF)Скачать новости (PDF)Descargar noticias (PDF)Pakua habari (PDF)