Forecasting: competitions vs consultants
Let’s say you have an important question about something that will happen in the future.
When will broadly superhuman AI be created, if it ever is?
If there is a direct war between nuclear powers, will nuclear weapons be used directly against civilian targets?
In the next 10 years, will a future variant of SARS2 cause as many deaths as have already been caused by SARS2?
How large will the market for petrol-powered cars be in 25 years?
There are at least two ways of getting answers to questions like this. First, you could pose the question as a forecasting competition - crowd forecasting platforms like Metaculus, Manifold Markets and Good Judgement Open allow people to predict events in exchange for points, and sometimes win money if they achieve a high score while prediction market platforms like Polymarket, PredictIt, Kalshi and Omen allow people to make bets on the outcomes of events.
Secondly, you could contract someone to research the question for you. I think the usual case of contracted research doesn’t produce forecasts as such; instead, it produces a collection of assumptions and results that you might be able to make into a forecast if you add your own views about how likely each set of assumptions is to hold (see this analysis of cultured meat for an example: it reviews several analyses, each of which consider different collections of assumptions and estimate costs of cultured meat as a consequence of each). Whether this kind of analysis is more or less useful for decision making than a direct forecast is a topic for another day. However, it is also possible to contract a team to produce a forecast for you - for example, Good Judgement Inc. offers this service.
My rough impression is that a modest majority of people excited about developing “forecasting technology” are focused on the competitive crowd forecasting side of things. Robin Hanson is one of the most prominent forecasting technology enthusiasts and is very enthusiastic about using prediction markets to make decisions. We can also notice that there seem to be many more platforms for competitive crowd forecasting than consultancies made of forecasters with exceptional track records (although: please correct me if I have this wrong).
However, I think consultant forecasts have some advantages over the competitive crowd forecast model. In particular, they might be more cost-efficient for some kinds of questions. Competitions have the advantage of directly incentivising accuracy, but when questions are complex, the ability to be flexible about research efforts could end up being more important than the better incentive alignment. If this is the case, two models that could do a reasonable job of balancing these considerations are
Consultants efficiently structure and synthesise crowd forecasts to produce answers to key questions
Competitions are for reputation, consultants are for answers
Forecasting complex questions involves breaking it down into smaller questions
In this recent EA forum post, an outstanding forecasting team considers the question:
What is the risk of death in the next month due to a nuclear explosion in London?
This is an important question with clear implications for decisions for people in London. To answer the question, they break it down into four components:
Will Russia/NATO nuclear warfare kill at least one person in the next month?
If a Russia/NATO nuclear war kills at least one person, will London be hit by a nuclear weapon?
If a nuclear weapon hits London, will well-informed people be unable to escape in time?
If a nuclear weapon hits central London, what percentage of people will die?
The team notes that there are other ways to break this question down, but they settled on this as a consensus.
So, in order to answer the top line question with high action relevance, it is necessary to break the question down into sub-questions, and answer the sub-questions individually.
A joint forecast is limited by the worst sub-forecast
Roughly speaking, the answer to the overall question is limited by the worst answer to one of the sub-questions. In order to make the example clear, we will have to consider two forecasters with improbably specialised knowledge:
Bert can, as well as anyone on the planet, forecast the likelihood of a Russia/NATO nuclear war next month, and he offers a credence of 0.05%
On the other hand, he absolutely cannot tell whether London would be hit by a nuclear weapon in the next month whether or not a Russia/NATO nuclear war materialises, so for this he offers a credence of 50% if it the war happens, and 10% if it does not (as I said, his knowledge is improbably specialised)
Phyllis can forecast the course of a Russia/NATO nuclear war as well as anyone on the planet, and she offers a 30% chance of a nuclear weapon hitting London, conditional on the war taking place, and a 0.0001% chance if it does not
On the other hand, she absolutely cannot tell whether such a war will take place, and offers a credence of 50% for this question
Overall, Bert offers a credence of (basically) 10% of a nuclear weapon hitting London in the next month 1. Phyllis, meanwhile, offers a credence of (basically) 15% of a nuclear weapon hitting London in the next month2.
Aggregating their forecasts gives a credence of 12.5% in a nuclear weapon hitting London in the next month. Most would agree that this is a terrible forecast! On the other hand, if the could talk to each other and make use of the other person’s specialised expertise, they could use each person’s “good” forecast and discard each person’s “bad” forecast. Then they would offer a credence of about 0.015%3. This is enormously different to 12.5%.
An important thing to consider about these different results: the 12.5% aggregate of forecasts does not reflect the fact that Phyllis or Bob are confident that nuclear weapons often hit London. Rather, it reflects the fact that each of them is highly uncertain about a crucial step in this question, and aggregating their answers ends up being dominated by this uncertainty. The level of uncertainty attributed to Bob is unreasonable, but even under more reasonable assumptions it remains true that if forecasters have uneven knowledge about a question, aggregating their forecasts does not make use of all of this knowledge.
Competitive forecasters generally keep information private
If I am aiming to beat other forecasters and I have knowledge of some, but not all, aspects of a particular question, then by sharing this information I am helping my competitors perform better while not helping myself at all. While there are many forecasters who are interested in helping forecasts to be as accurate as possible, and who like to share information, the incentives of competitions don’t encourage them to do this.
Some solutions
Everyone knows everything
If both Bob and Phyllis are experts on both questions, then they can both offer good forecasts and the aggregate of their forecast will also be very good. However, this means that Bob and Phyllis have to independently do the same research. Given that forecasting competitions benefit from aggregating at least 10 forecasts, this means that many people might be independently researching the same things.
A substantial amount of research into sub-questions will involve finding out key facts about the issue. There is benefit from having 10 people offer forecasts, but much less benefit from having 10 people independently go and discover the same set of facts. A competition incentivises everyone to research everything themselves.
Furthermore, this rules out effectively making use of specialised knowledge people might already have.
In contrast, a forecasting team can flexibly allocate research efforts so that multiple people do not end up doing research that only needs to be done once. Depending on the nature of the question, this could end up making much more efficient use of resources than a competitive forecast.
Ask all the sub-questions on the forecasting platform
Instead of asking only the question of interest (“What is the risk of death in the next month due to a nuclear explosion in London?”), someone running a competition could decide on a way to break the question down into sub-questions and ask all of the sub-questions. In this manner, competitors automatically share their specialised knowledge about different sub-questions.
To get an idea for how this works: Bob, knowing that he is an expert in war likelihood but not in how wars play out, is likely to contribute a forecast for the likelihood of Russia/NATO war fairly close to his own best guess. On the other hand, he is likely to defer to the crowd on the question of a nuclear bomb hitting London, conditional on war.
This approach has an advantage over a team forecast: people with specialised knowledge can contribute to individual questions without having to be identified by team recruiters beforehand.
However, coming up with a good set of sub-questions and writing them out clearly takes a lot of work by itself. The costs of this can be substantially higher than the total prizes currently associated with tournaments, and may be comparable to the costs of subsidising trading on a single Polymarket question. Thus this approach is a kind of “hybrid crowd/consultant forecast”.
Competitions in models plus forecasts
A third possibility is to modify “forecast + model competition” (or market) structure so that participants are incentivised to make public models as well as public forecasts. One could imagine that the result is something like a hybrid between Metaculus and Kaggle, where people compete to submit accurate forecast sets like Metaculus, and to submit models that automatically leverage these forecast sets into accurate higher level forecasts, a bit like Kaggle. The crowd of “modellers” could incentivise raw forecasts on particular questions that they think will be particularly helpful for their models, which automatically accomplishes the “put all the sub-questions onto the platform” aim.
This option is quite speculative, whereas other options could basically be done today.
Competitions for reputation, consultants for answers
Another hybrid model puts more emphasis on consultant forecasts and less on the crowd than the “ask all the questions” model. Here, the primary role of competitions is not to find answers to key questions, but for forecasting teams to prove their capability; highly capable teams are then contracted to produce forecasts for crucial questions.
This has a number of advantages and disadvantages compared to the model of asking all the questions. A number of them are addressed in this EA forum article.
Consultants have other advantages over crowd forecasts - for example, information can be kept private.
Consulting is also enormously more lucrative than forecasting competitions. Granted, most consulting isn’t about making forecasts, but the industry is so much bigger that I would guess that even consulting services for which forecasts are a major component is >10x the size of the competitive forecasting industry (65% credence). This might be because, for reasons that are not clear to me right now, the consulting model is much better at serving organisation’s needs than forecasting competitions. Because contract forecasting is closer to a consulting model, it might also realise some of these benefits.
Conclusion
The value of a forecast depends on whether it addresses an important question, and whether it is a good forecast compared to others available. It may be the case that a hybrid contract + crowd model can produce better forecasts on crucial questions than competitive crowd forecasting alone. There are two reasonable hybrid models to consider:
Consultants efficiently structure and synthesise crowd forecasts to produce answers to key questions
Competitions are for reputation, consultants are for answers
Note that the business model for Good Judgement, Inc. seems to be a mixture of both of these.
Bert’s calculation if: 50% credence if there is no Russia/NATO war*0.05% chance of war, and 10% if there is no war * 99.5% chance, rounded down to 10%
Phyllis’ calculation is: 30% if the Russia/NATO war takes place * 50% credence of this + 0.0001% if the war does not take place * 50% credence in this, rounded down to 15%
Calculation: 30% if Russia/NATO war takes place * 0.05% credence + 0.0001% if the war does not take place*99.5% credence, rounded down to 0.015%