Measure is unceasing

Relative Impact of the First 10 EA Forum Prize Winners

Summary

Introduction

The EA forum—and local groups—have been seeing a decent amount of projects, but few are evaluated for impact. This makes it difficult to choose between projects beforehand, beyond using personal intuition (however good it might be), a connection to a broader research agenda, or other rough heuristics. Ideally, we would have something more objective, and more scalable. 

As part of QURI’s efforts to evaluate and estimate the impact of things in general, and projects QURI itself might carry out in particular, I tried to evaluate the impact of 10 projects I expected to be fairly valuable.

Methodology

I chose the first 10 posts which won the EA Forum Prize, back in 2017 and 2018, to evaluate. For each of the 10 posts, each estimate has a structure like the one below. Note that not all estimates will have each element:

Title of the post

If a writeup refers to a project distinct from the writeup, I generally try to estimate the impact of both the project and the writeup.

Where possible, I estimated their impact in an ad-hoc scale, Quality Adjusted Research Papers (QARPs for short), whose levels correspond to the following:

Value

Description

Example

Value Description Example
~0.1 mQARPs A thoughtful comment A thoughtful comment about the details of setting up a charity
~1 mQARPs A good blog post, a particularly good comment What considerations influence whether I have more influence over short or long timelines?
~10 mQARPs An excellent blog post Humans Who Are Not Concentrating Are Not General Intelligences
~100 mQ A fairly valuable paper Categorizing Variants of Goodhart’s Law.
~1 QARPs A particularly valuable paper The Vulnerable World Hypothesis
~10-100 QARPs A research agenda The Global Priorities Institute’s Research Agenda.
~100-1000+ QARPs A foundational popular book on a valuable topic Superintelligence, Thinking Fast and Slow
~1000+ QARPs A foundational research work Shannon’s “A Mathematical Theory of Communication.”

Ideally, this would both have relative meaning (i.e., I claim that an average thoughtful comment is worth less than an average good post), and absolute meaning (i.e., after thinking about it, a factor of 10x between an average thoughtful comment and an average good post seems roughly right). In practice, the second part is a work in progress. In an ideal world, this estimate would be cause-independent, but cause comparability is not a solved problem, and in practice the scale is more aimed towards long-term focused projects.

To elaborate on cause independence, upon reflection we may find out that a fairly valuable paper on AI Alignment might be 20 times as a fairly valuable paper on Food Security, and give both of their impacts in a common unit. But we are uncertain about their actual relative impacts, and they will not only depend on uncertainty, but also on moral preferences and values (e.g., weight given to animals, weight given to people who currently don’t exist, etc.) To get around this, I just estimated how valuable a projects is within a field, leaving the work of categorizing and comparing fields as a separate endeavor: I don’t adjust impact for different causes, as long as it’s an established Effective Altruist cause.

Some projects don’t easily lend themselves to be rated in QARPs; in that case I’ve also used “dollars moved”. Impact is adjusted for Shapley values, which avoids double or triple-counting impact. In every example here, this will be equivalent to calculating counterfactual value, and dividing by the number of necessary stakeholders. This requires a judgment call for what is a “necessary stake-holder”. Intervals are meant to be 80% confidence intervals, but in general all estimates are highly speculative and shouldn’t be taken too seriously.

Estimates

2017 Donor Lottery Report

Total project impact:

Counterfactual impact of Adam Gleave winning the donor lottery (as opposed to other winners):

Impact of the writeup alone:

Takeaways from EAF’s Hiring Round.

Impact of the hiring round itself:

When reviewing this section, some commenters pointed out that, for them, calculating the opportunity cost didn’t make as much sense. I disagree with that. Further, I’m also not attempting to calculate the expected value ex ante; in this case this feels inelegant because the expected value will depend a whole bunch on the information, accuracy and calibration of the person doing the expected value calculation, and I don’t want to estimate how accurate or calibrated the piece’s author was at the time (though he is pretty good now).

Impact of the writeup (as opposed to impact of the hiring process):

Why we have over-rated Cool Earth

Impact of the post and the research:

Lessons Learned from a Prospective Alternative Meat Startup Team

Expected impact of the project:

Impact of the project:

Impact of the writeup:

2018 AI Alignment Literature Review and Charity Comparison

Cause profile: mental health

EA Giving Tuesday Donation Matching Initiative 2018 Retrospective

EA Survey 2018 Series: Cause Selection

Impact of the post alone:

EAGx Boston 2018 Postmortem

Impact of the EAGx:

Impact of the writeup:

Will companies meet their animal welfare commitments?

Table

Project Ballpark Estimate
2017 Donor Lottery Grant Between 6 fairly valuable papers and 3 really good ones 500 mQARPs to 4 QARPs
Adam Gleave winning the 2017 donor lottery (as opposed to other participants) Roughly as valuable as a fairly valuable paper -50 mQARPs to 500 mQARPs
2017 Donor Lottery Report (Writeup) A little less valuable than a fairly valuable paper 50 mQARPs to 250 mQARPs
EAF’s Hiring Round Loss of between one fairly valuable paper to an excellent EA forum blog post. -70 to 5 mQARPs
Takeaways from EAF’s Hiring Round (Writeup) Between two good EA forum posts to a fairly valuable paper 0 to 30 mQARPs
Why we have over-rated Cool Earth -½ excellent EA forums post to +1.5 excellent EA forum posts -5 to 20 mQARPs
Alternative Meat Startup Team (Project) 0 to 1 excellent EA Forum posts. 1 to 50 mQARPs
Lessons Learned from a Prospective Alternative Meat Startup Team (Writeup) 0 to 5 good EA forum posts 0 to 20 mQARPs
2018 AI Alignment Literature Review and Charity Comparison Between two excellent EA forum posts and 6 fairly valuable papers 40 to 800 mQARPs
Cause profile: mental health Very uncertain 0 to 100 mQARPs
EA Giving Tuesday Donation Matching Initiative 2018 $130K to $230K in Shapley-adjusted funding towards EA charities
EA Survey 2018 Series: Cause Selection 0 to an excellent EA forum post. 0 to 20 mQARPs
EAGx Boston 2018 (Event) $100 to $350K in Shapley-adjusted funding towards EA charities
EAGx Boston 2018 Postmortem (Writeup) $0 to $500 in Shapley-adjusted donations towards EA charities
Will companies meet their animal welfare commitments? 0 to a fairly valuable paper 0 to 100 mQARPs

Comments and thoughts

Calibration

An initial challenge in this domain relates to how to attain calibration. The way I would normally calibrate intuitions on a domain is by making a number of predictions at various levels of gut feeling, and then seeing empirically how frequently predictions made at different levels of gut feeling come out right. For example, I’ve previously found that my gut feeling of “I would be very surprised if this was false” generally corresponds to 95% (so 1 in 20 times, I am in fact wrong). But in this case, when considering or creating a new domain, I can’t actually check my predictions directly against reality, but instead have to check them against other people’s intuitions.

Comparison is still possible

Despite my wide levels of uncertainty, comparison is still possible. Even though I’m uncertain about the impact of both “Will companies meet their animal welfare commitments?” and “Lessons Learned from a Prospective Alternative Meat Startup Team”, I’d prefer to have the first over the second.

Similarly, while EAGx Boston 2018 and the EA Giving Tuesday Donation Matching Initiative might have taken similar amounts of time to organize, by comparably capable people, I prefer the second. This is in large part because EAGx events are scalable, whereas Giving Tuesdays are not.

I was also surprised by the high cost of producing papers when estimating the value of Larks’ review (though perhaps I shouldn’t have been). It could be the case that this was a problem with my estimates, or that papers truly are terribly inefficient. 

Future ideas

Ozzie Gooen has in the past suggested that one could build a consensus around these kinds of estimates, and scale them further. In addition, one could also use these kinds of estimates to choose one’s own projects, or to recommend projects to others, and see how that fares. Note how in principle, these kinds of estimates don’t have to be perfect or perfectly calibrated, they just have to be better than the implicit estimates which would otherwise have been made. 

In any case, there are also details to figure out or justify. For example, I’ve been using Shapley values, which I think are a more complicated, but often a more appropriate alternative to counterfactual values. Normally, this just means that I divide the total estimated impact by the estimated number of stakeholders, but sometimes, like in the case of a hiring round, I have the intuition that one might want to penalize the hiring organization for the lost opportunity cost of applicants, even though that’s not what Shapley values recommends. Further, it’s also sometimes not clear how many necessary stakeholders there are, or how important each stakeholder is, which makes the Shapley value unambiguous, or subject to a judgment call. 

I’ve also been using a cause-impartial value function. That is, I judge a post in the animal welfare space using the same units as for a post in the long-termist space. But maybe it’s a better idea to have a different scale for each cause area, and then have a conversion factor which depends on the reader’s specific values. If I continue working on this idea, I will probably go in that direction. 

Lastly, besides total impact, we also care about efficiency. For small and medium projects, I think that the most important kind of efficiency might be time efficiency. For example, when choosing between a project worth 100 mQARPs and one which is worth 10 mQARPs, one would also have to look at how long each takes, because maybe one can do 50 projects each worth 10 mQARPs in the time it takes to do a very elaborate 100 mQARPs project. 

Thanks to David Manheim, Ozzie Gooen and Peter Hurford for thoughts, comments and suggestions.