Measure is unceasing

Forecasting Newsletter: August 2020.

Highlights

538 releases model of the US elections; Trump predicted to win ~30% of the time.

Study offers instructive comparison of New York covid models, finds that for the IHME model, reported death counts fell inside the 95% prediction intervals only 53% of the time.

Biggest decentralized trial to date, with 511 jurors asked to adjudicate a case coming from the Omen prediction market: “Will there be a day with at least 1000 reported corona deaths in the US in the first 14 days of July?.”

Index

Sign up here or browse past newsletters here.

Prediction Markets & Forecasting Platforms

On PredictIt, presidential election prices are close to even odds, with Biden at 55, and Trump at 48.

Good Judgement Inc. continues providing their dashboard, and the difference between the probability assigned by superforecasters to a Biden win (~75%), and those offered by betfair (~55%) was enough to make it worth for me to place a small bet. At some point, Good Judgement Inc. and Cultivate Labs started a new platform on the domain covidimpacts.com, but forecasts there seem weaker than on Good Judgement Open.

Replication Markets started their COVID-19 round, and created a page with COVID-19 resources for forecasters.

Nothing much to say about Metaculus this month, but I appreciated their previously existing list of prediction resources.

Foretell has a blog, and hosted a forecasting forum which discussed

Meanwhile, ethereum-based prediction markets such as Omen or Augur are experiencing difficulties because of the rise of decentralized finance (DeFi) and speculation and excitement about it. That speculation and excitement has increased the gas price (fees), such that making a casual prediction is for now too costly.

In The News

Forecasting the future of philanthropy. The American Lebanese Syrian Associated Charities, the largest healthcare related charity in the United States, whose mission is to fund the St. Jude Children’s Research Hospital. To do this, they employ aggressive fundraising tactics, which have undergone modifications throughout the current pandemic.

Case 302: the Largest Decentralized Trial of All Time. Kleros is a decentralized dispute resolution platform. “In July, Kleros had its largest trial ever where 511 jurors were drawn in the General Court to adjudicate a case coming from the Omen prediction market: Will there be a day with at least 1000 reported Corona death in the US in the first 14 days of July?.” Link to the case

ExxonMobil Slashing Permian Rig Count, Forecasting Global Oil Glut Extending ‘Well into 2021’. My own interpretation is that the gargantuan multinational’s decision is an honest signal of an expected extended economic downturn.

Supply is expected to exceed demand for months, “and we anticipate it will be well into 2021 before the overhang is cleared and we returned to pre-pandemic levels,” Senior Vice President Neil Chapman said Friday during a conference call.

“Simply put, the demand destruction in the second quarter was unprecedented in the history of modern oil markets. To put it in context, absolute demand fell to levels we haven’t seen in nearly 20 years. We’ve never seen a decline with this magnitude and pace before, even relative to the historic periods of demand volatility following the global financial crisis and as far back as the 1970s oil and energy crisis.”

Even so, ExxonMobil’s Permian rig count is to be sharply lower than it was a year ago. The company had more than 50 rigs running across its Texas-New Mexico stronghold as of last fall. At the end of June it was down to 30, “and we expect to cut that number by at least half again by the end of this year,” Chapman said.

Google Cloud AI and Harvard Global Health Institute Collaborate on new COVID-19 forecasting model.

Betting markets put UK-EU trade deal in 2020 at 66% (now 44%).

Experimental flood forecasting system didn’t help in Mumbai. The system was to provide a three day advance warning, but didn’t.

FiveThirtyEight covers various facets of the USA elections: Biden Is Polling Better Than Clinton At Her Peak, and releases their model, along with some comments about it

In other news, this newsletter reached 200 subscribers last week.

Hard to Categorize

Groundhog day is a tradition in which American crowds pretend to believe that a small rat has oracular powers.

Tips for forecasting on PredictIt. These include betting against Trump voters who arrive at PredictIt from Breitbart.

Linch Zhang asks What are some low-information priors that you find practically useful for thinking about the world?

AstraZeneca looking for a Forecasting Director (US-based).

Genetic Engineering Attribution Challenge.

NSF-funded tournament looking to compare human forecasters with a random forest ML model from Johns Hopkins in terms of forecasting the success probability of cancer drug trials. More info here, and one can sign-up here. I’ve heard rewards are generous, but they don’t seem to be specified on the webpage. Kudos to Joshua Monrad.

Results of an expert forecasting session on covid, presented by expert forecaster Juan Cambeiro.

A playlist of podcasts related to forecasting. Kudos to Michał Dubrawski.

Long Content

A case study in model failure? COVID-19 daily deaths and ICU bed utilization predictions in New York state and commentary: Individual model forecasts can be misleading, but together they are useful.

In this issue, Chin et al. compare the accuracy of four high profile models that, early during the outbreak in the US, aimed to make quantitative predictions about deaths and Intensive Care Unit (ICU) bed utilization in New York. They find that all four models, though different in approach, failed not only to accurately predict the number of deaths and ICU utilization but also to describe uncertainty appropriately, particularly during the critical early phase of the epidemic. While overcoming these methodological challenges is key, Chin et al. also call for systemic advances including improving data quality, evaluating forecasts in real-time before policy use, and developing multi-model approaches.

But what the model comparison by Chin et al. highlights is an important principle that many in the research community have understood for some time: that no single model should be used by policy makers to respond to a rapidly changing, highly uncertain epidemic, regardless of the institution or modeling group from which it comes. Due to the multiple uncertainties described above, even models using the same underlying data often have results that diverge because they have made different but reasonable assumptions about highly uncertain epidemiological parameters, and/or they use different methods

.. the rapid deployment of this approach requires pre-existing infrastructure and evaluation systems now and for improved response to future epidemics. Many models that are built to forecast on a scale useful for local decision making are complex, and can take considerable time to build and calibrate

a group with a history of successful influenza forecasting in the US (Los Alamos National Lab (4)) was able to produce early COVID-19 forecasts and had the best coverage of uncertainty in the Chin et al. analysis (80-100% of observations fell within the 95% prediction interval for most forecasts). In contrast, the new Institute for Health Metrics and Evaluation statistical approach had low reliability; after the latest analyzed revision only 53% of reported death counts fell with the 95% prediction intervals.

The original IHME model underestimates uncertainty and 45.7% of the predictions (over 1- to 14-step-ahead predictions) made over the period March 24 to March 31 are outside the 95% PIs. In the revised model, for forecasts from of April 3 to May 3 the uncertainty bounds are enlarged, and most predictions (74.0%) are within the 95% PIs, which is not surprising given the PIs are in the order of 300 to 2000 daily deaths. Yet, even with this major revision, the claimed nominal coverage of 95% well exceeds the actual coverage. On May 4, the IHME model undergoes another major revision, and the uncertainty is again dramatically reduced with the result that 47.4% of the actual daily deaths fall outside the 95% PIs—well beyond the claimed 5% nominal value.

the LANL model was the only model that was found to approach the 95% nominal coverage, but unfortunately this model was unavailable at the time Governor Cuomo needed to make major policy decisions in late March 2020.

Models that are consistently poorly performing should carry less weight in shaping policy considerations. Models may be revised in the process, trying to improve performance. However, improvement of performance against retrospective data offers no guarantee for continued improvement in future predictions. Failed and recast models should not be given much weight in decision making until they have achieved a prospective track record that can instill some trust for their accuracy. Even then, real time evaluation should continue, since a model that performed well for a given period of time may fail to keep up under new circumstances.

Do Prediction Markets Produce Well‐Calibrated Probability Forecasts?.

Abstract: This article presents new theoretical and empirical evidence on the forecasting ability of prediction markets. We develop a model that predicts that the time until expiration of a prediction market should negatively affect the accuracy of prices as a forecasting tool in the direction of a ‘favourite/longshot bias’. That is, high‐likelihood events are underpriced, and low‐likelihood events are over‐priced. We confirm this result using a large data set of prediction market transaction prices. Prediction markets are reasonably well calibrated when time to expiration is relatively short, but prices are significantly biased for events farther in the future. When time value of money is considered, the miscalibration can be exploited to earn excess returns only when the trader has a relatively low discount rate.

We confirm this prediction using a data set of actual prediction markets prices from1,787 market representing a total of more than 500,000 transactions

Paul Christiano on learning the Prior and on better priors as a safety problem.

A presentation of radical probabilism; a theory of probability which relaxes some assumptions in classical Bayesian reasoning.

Forecasting Thread: AI timelines, which asks for (quantitative) forecasts until human-machine parity. Some of the answers seem insane or suspicious, in that they have very narrow tails, sharp spikes, and don’t really update on the fact that other people disagree with them.


Note to the future: All links are added automatically to the Internet Archive. In case of link rot, go there and input the dead link.


We hope that people will pressure each other into operationalizing their [big picture outlooks]. If we have no way of proving you wrong, we have no way of proving you right. We need falsifiable forecasts.

Source: Foretell Forecasting Forum. Inexact quote.