Friday, 24 October 2014

Should we use Bayes' Theorem to do History?

It's simpler than it looks!  Image from Richard Carrier's website.

I think so.

I'm going to try my hand at writing this discussion up as a dialogue...

What is Bayes' Theorem?

Bayes' Theorem is a formula for calculating logically how well a theory is supported by the evidence.  It works by multiplying and dividing probabilities.  It is explained here and here.  You can see how I used it in all seriousness to analyse the verdict in the case of the Lockerbie Bombing or for fun in the case of Twelve Angry Men.  The Bayesian method is also used by Richard Carrier in his proof (I consider it proven!) of the non-existence of Jesus: I got the method from him and my discussion is indebted to his writing.

The basic idea is simple:
  • Look at the historical evidence that has (and has not) been discovered.
  • Consider: how likely was it that all this evidence we see would be the outcome of historical events if your theory about what happened is true?
  • Now also consider:  how likely was it that all this evidence we see would be the outcome of historical events if your theory about what happened is not true?
  • Now put those two probabilities into a ratio like 2:1 and you have the probability that your theory about what happened is true.  This is called the 'conditional probability' of your theory being true.
So fundamentally, Bayes' Theorem is useful for working out how well the evidence supports a theory in comparison with other theories.  Those theories are competing explanations of what caused the evidence to exist.  The explicit comparison of theories helps you to avoid a common mistake in historical reasoning, i.e. seeking evidence that seems to confirm your pet theory while not giving alternative theories adequate consideration.

What about the more complex version?

There's another key factor to explain: 'prior probability'.

This is the probability we estimate of a theory being true, even before considering the specific, detailed evidence.  That is, how probable we think it is based just on what sort of thing it proposes.  Prior probability is the reason why 'extraordinary claims require extraordinary evidence'.  In other words, the less inherently probable a claim or theory, the stronger is the specific evidence required to overcome this inherent improbability.  The existence of magic, for example, would require extremely strong evidence to overcome the initial improbability of phenomena existing that violate known laws of physics.

Thinking in historical terms, a theory that proposes that Martin Luther wrote friendly letters to the Pope has a low prior probability, in other words is inherently unlikely, because it goes against everything we expect based on our general knowledge about Luther.  Historians would demand extremely strong evidence before accepting this theory.  On the other hand, a theory that proposes that Luther sometimes caught a cold has a very high prior probability, since it is just the sort of thing that we know tends to happen to people, Luther included, based on our general knowledge.  We would not need much evidence at all to accept this theory.

So any theory has to be given both a prior probability based on general knowledge, and a conditional probability based on the specific evidence of the case in hand.

Can you show me how the formula represents this method of reasoning?

I sure can:

I tweaked this image from first publication.

Why should I concern myself with how likely the evidence was to be the outcome of events if my theory was not true?

You need to be aware that the evidence that makes sense on your theory might also make sense on different theories too.  For example, the evidence of a broken window on your house might suggest your house has been burgled.  The evidence makes sense on your theory.  It matches.  But, of course, it might just be that some kids kicked a football against the window.  So just because your theory explains the evidence, doesn't mean it is the only or the best explanation.  That's why Bayes' Theorem considers how likely the evidence was, even if your theory was wrong.

How do you take into account all the different pieces of evidence?

We estimate the probability of each piece of evidence existing on a given theory, then multiply all those probabilities together for an overview of the probability of all the evidence existing on that theory.

Don't historians already have a logical method for considering the merits of different theories in that manner: Argument to the Best Explanation (ABE)?

They do.  It's laid out here.  The thing is, when you analyse it, this method is completely represented by Bayes' Theorem—and improved too!

Let's see how this is so:

  • 'The statement, together with other statements already held to be true, must imply yet other statements describing present, observable data.'
    This is what Bayes' Theorem is all about: using evidence to assess theories.
  • 'The hypothesis must be of greater explanatory scope than any other incompatible hypothesis about the same subject; that is, it must imply a greater variety of observation statements.'
    The more evidence that is explained only by your theory, the higher its ratio of probability will turn out.
  • 'The hypothesis must be of greater explanatory power than any other incompatible hypothesis about the same subject; that is, it must make the observation statements it implies more probable than any other.'
    The more the evidence was more probable on your theory than on another, the higher its probability again.
  • 'The hypothesis must be more plausible than any other incompatible hypothesis about the same subject; that is, it must be implied to some degree by a greater variety of accepted truths than any other, and be implied more strongly than any other; and its probable negation must be implied by fewer beliefs, and implied less strongly than any other.'
    This is synonymous with requiring a high prior probability.
  • 'The hypothesis must be less ad hoc than any other incompatible hypothesis about the same subject; that is, it must include fewer new suppositions about the past which are not already implied to some extent by existing beliefs.'
    The more ad hoc assumptions you have to make to keep your theory alive, the less probable it will turn out, because each uncertain assumption you add to your theory reduces the theory's probability.  It's just the 'and' rule of multiplying probabilities, and is accounted for by a reduced prior probability.
  • 'It must be disconfirmed by fewer accepted beliefs than any other incompatible hypothesis about the same subject; that is, when conjoined with accepted truths it must imply fewer observation statements and other statements which are believed to be false.'
    This will also be accounted for by the prior probability, or plausibility, of your theory.
  • 'It must exceed other incompatible hypotheses about the same subject by so much, in characteristics 2 to 6, that there is little chance of an incompatible hypothesis, after further investigation, soon exceeding it in these respects.'
    This is represented by the relative consequent probabilities of the various theories: is your theory by far and away the most likely explanation of the evidence?
So, you see, historians who use a logical method of historical reasoning already use Bayesian logic without realising.  The only bit they don't do is the arithmetic.

If ABE logically reduces to Bayes' Theorem, then why bother using the theorem?

If you do the verbal reasoning logically, but don't assign probabilities quantitatively and do the maths, then you risk failing to combine logically the results of your consideration of each separate piece of evidence.  You might be biased by the tendency of the evidence you considered first.  Or you might allow one piece of evidence to overrule another when their relative strengths do not justify this.  You might just fail to take all the evidence into account, especially weak pieces of evidence that nevertheless multiply up to strong evidence when taken together.

Since you are dealing with probabilities anyway, you need to use the logic of probabilities.  In the words of this article defending the use of Bayes' Theorem in court:
Bayes theorem is a basic rule, akin to any other proven maths theorem, for updating the probability of a hypothesis given evidence. Probabilities are either combined by this rule, or they are combined wrongly.
So to refuse to use Bayesian reasoning is a refusal to think logically.

Aren't your prior and conditional probability estimates just subjective opinions?

They may indeed be.  If you do not have lots of objective data for making probability estimates, then your conclusions will unavoidably be unreliable and unscientific.  But this is not the fault of Bayes' Theorem or Bayesian reasoning.  Subjectivity and uncertainty will mar the results of Plain English verbal reasoning just as badly.  Plus you will have the added disadvantage of verbal reasoning of foregoing a logical method for combining your consideration of all the evidence together.

How do you make probability estimates about historical events?

Good question!  It's a lot harder than estimating the probability of getting a 6 off the roll of a die!

Let's say you want to know the probability that a certain general rose through the ranks of the Spartan army.  You could consider all the Spartan generals whose biographies you know, and calculate what proportion of them rose through the ranks.  That might give you a first approximation of the prior probability of this having happened to the general you are interested in.

If you are unable to come up with a convincing, objective estimate using such methods, even when you consider everything you know and all the evidence, then you will just have to accept that you will never have an objective, scientific estimate of the strength of your theory that your general did or did not rise through the ranks.

My point is, it isn't a valid objection to using Bayes' Theorem to say the data aren't scientific.  Junk in, junk out!

How can I use responsibly use Bayes' Theorem if I don't have good data?

It would be a problem, if people looked at your use of numbers and assumed your results were scientific.  Bayes' Theorem might look 'sciencey' to people who are unfamiliar with it.  So you need to warn them that your results are only as objective and scientific as the data you used to make your estimates.

If you are unable to come up with defensible estimates of exact probabilities, you might at least be able to come up with reasonable maximum and/or minimum values.  This is what Richard Carrier does in his book on Jesus: he uses maximum values that allow Jesus' existence to be as likely as he thinks is reasonably feasible—then still finds his existence highly unlikely.  That way, his argument is proven true a fortiori.  In other words, his theory is at least as likely as he estimates it minimally to be.

Doesn't multiplying probabilities together spoil the historian's sense of how the evidence fits together as a whole?


Multiplying probabilities is the logical way to combine the evidence.  The part where you get to come up with a sense of the evidence as a whole is when you contrive your theory for explaining it all.  Then you get to test your theory against the evidence using a logically valid method.

You're boring me.  Sum up.

So, historians thinking logically are already using Bayesian reasoning.  They are already considering how well different theories explain the evidence.  What they are not doing is using maths to combine the evidence together logically.

Assigning probability values is often subjective and unscientific.  But forcing yourself to try to assign such values will be helpful in exposing to view just how subjective and unscientific your assumptions are.  It will help you to see what data you need to search for to make your premises more objective.  It should also push you to use maximum and minimum estimates so that your conclusions show the range of likelihoods that your theory might have.

As long as you remember that the validity of your results still depends on the objectivity of your data and the logic of your reasoning about probabilities, and that a 'sciencey' formula won't do the hard work for you, you can only make your conclusions more logical by using Bayes' Theorem!

Thursday, 23 October 2014

Did the 12 Angry Men let a murderer go free?


A couple of years ago AVclub.com published an article about the film 12 Angry Men.  It argued that juror no. 8, who convinces the other, initially pro-guilt, jurors to vote not-guilty in a murder case, persuaded them by fallacious logic and ensured they came to the wrong verdict:
Rose [Reginald, the screenwriter], an expert at dramatic construction, has his hero, Juror No. 8 (Fonda in the movie), undermine each of these pieces of evidence individually, assisted along the way by those who’ve defected to the Not Guilty camp...

None of this ultimately matters, however, because determining whether a defendant should be convicted or acquitted isn’t—or at least shouldn’t be—a matter of examining each piece of evidence in a vacuum. “Well, there’s some bit of doubt attached to all of them, so I guess that adds up to reasonable doubt.” No. What ensures The Kid’s guilt for practical purposes, though neither the prosecutor nor any of the jurors ever mentions it (and Rose apparently never considered it), is the sheer improbability that all the evidence is erroneous. You’d have to be the jurisprudential inverse of a national lottery winner to face so many apparently damning coincidences and misidentifications. Or you’d have to be framed... But there’s no reason offered in 12 Angry Men for why, say, the police would be planting switchblades.
We know what the logic is for combining separate items of probabilistic evidence into an overall estimate of the probability of guilt: Bayes' Theorem.  It's explained here and here.  I've used it a couple of times, to analyse the Lockerbie and Pistorius cases.  In the words of this article defending the use of Bayes' Theorem in court:
Bayes theorem is a basic rule, akin to any other proven maths theorem, for updating the probability of a hypothesis given evidence. Probabilities are either combined by this rule, or they are combined wrongly.
We have a theory to test: the defendant is guilty.  Call this theory h for 'hypothesis'.

What we do, for each item of evidence, is to estimate the probability of the evidence being the outcome of events if h is true.

Then we estimate the probability of the same item of evidence being the outcome of events if h is not true (i.e. if ~h is true).

We put the probability of each piece of evidence on h and ~h into a ratio.

We multiply the ratios together into a total conditional probability ratio for each theory, guilty and not-guilty.

Voilà, in that ratio we have the estimated probability of the defendant being guilty.

(For present purposes, I will ignore 'prior probability'.  This is the probability we estimate of a theory being true, even before considering the specific, detailed evidence, based just on what sort of thing it proposes.  Prior probability is the reason why 'extraordinary claims require extraordinary evidence': the less inherently probable a claim or theory, the better specific evidence is required to overcome this inherent improbability.  The existence of magic, for example, would require extremely strong evidence to overcome the initial improbability of phenomena existing that violate known laws of physics.  Since we do not know anything about the world of the film, e.g. how many murder defendants brought to trial are in fact guilty, I will assume 'indifferent' priors, i.e. a 50% chance of guilt.)


The evidence (taken from the screenplay)

  1.  The old man in the apartment below the crime-scene heard loud noises through his open window at 12:10 a.m. that he said sounded like a fight.  He heard the defendant shout, 'I'm gonna kill you', then heard a body fall.  He ran to look outside and saw the defendant running down the stairs and away.  He called the police who found the defendant's father knifed to death with the knife in his chest.  The old man picked out the defendant's voice by hearing alone from among four others in court.  He knows the defendant well.  However, juror 8 proposes that the El train was roaring by as the murder took place (as per point 5), and so the old man could not have heard, or heard clearly, what was going on upstairs.  He came into court in dilapidated clothes, and appeared to juror 9 to be hiding his limp out of shame; juror 9 suggested he exaggerated his testimony for the sake of having a moment in the limelight.
  2. Juror 8 simulates the old man limping from his bed to the window, taking 42 seconds to do so, suggesting that if he heard the body fall while in bed then heard the murderer running down the stairs 15 seconds later, then he could not in fact have seen the murderer out the window.
  3. The coroner determined the time of death as around 12 a.m..
  4. The defendant claimed to have been at the cinema at 12 a.m., yet failed to remember what films he saw.
  5. There is no witness to the defendant entering or exiting the cinema.
  6. The woman across the street looked through the window onto the crime scene; she said she saw the defendant stab his father to death sixty feet away just as she looked out.  However, she only saw the vital moments through the windows of a passing, darkened El train.  Famously, juror 8 recalls that she had indentations on her nose due to habitually wearing glasses; he argues that if she saw the murder just as she looked out the window while tossing and turning in bed, then she could not have been wearing her glasses.
  7. There were witnesses by hearing to the defendant and his father arguing at 8 p.m..  They heard the father hit the boy twice, and saw the boy walk out of the building in an angry mood.
  8. The defendant has been regularly beaten by his father growing up.
  9. The defendant has several violent crimes on his record (is the jury allowed to know this?)
  10. The murder weapon was a distinctive kind of knife known to be owned by the defendant.  He bought it shortly after leaving the house, witnessed by the shopkeeper.  Witnesses saw it in his possession at 9:45 p.m.  The defendant arrived home at 10 p.m.  He claims his knife slipped through a hole in his pocket between then and returning from the cinema at 3:15 a.m..  The 8th juror shows the others that he has procured exactly the same sort of knife for himself from a shop near the crime-scene, showing that it is not unique or unavailable (surely grounds for a mistrial, as the jury is considering evidence not presented in court?)
  11. Juror 5 says that people handy with switch-blades, like the defendant, would stab with an underhand grip, but the victim was stabbed overhand, to judge from the coroner's assessment of the wound.
  12. The father was a tough man and compulsive gambler, known for a propensity to get into bar-fights, particularly over women.
  13. The defendant returned home at 3:15 a.m., where he was arrested by police.

Probability analysis

My estimates of probabilities are just that: my subjective ideas about what is likely or unlikely.  This is hardly scientific, since we do not have the objective data for that.  It is fair to criticise Bayesian reasoning for the uncertainty and subjectivity of the estimates used, as long as the critic understands that we cannot escape these problems simply by forswearing the use of numbers and going back to vague words.  If subjectivity causes a problem for Bayesian reasoning, then it will cause the same problem for reasoning from evidence in general, including the sort of reasoning that jurors have to perform.  On the positive side, using Bayes' Theorem will at least ensure that, whatever estimates are made regarding the separate pieces of evidence, they are logically combined into a view of the evidence as a whole.  It should be noted, then, that conclusions derived by Bayesian reasoning are not better than verbal reasoning by virtue of a more scientific appreciation of the premises, but may be more logical in drawing conclusions from those premises, valid or invalid as they may be.  Perhaps a juror's decision is not so tricky, since they can use the inherent uncertainty to justify the default not-guilty vote whenever guilt is not rigorously proven.  I discuss why we should use Bayes' Theorem to do History here.

For each piece of evidence, 1-13, I will assign a probability that it would be the outcome of events on the theory of the defendant being guilty (h) or not-guilty (~h), then express the two probabilities as a ratio.  Then when we multiply the ratios together, we will have a conditional probability of guilt.  (N.B. I'm unsure about points 7 and 8 below; I'd like someone experienced with Bayesian statistics to tell me if I've got them right or wrong.)
  1. I will allow that the old man could not have clearly heard events on the floor above due to the noise of the train.  It would therefore seem that he did indeed invent or exaggerate his evidence, a finding corroborated by his inability to really get to the window quick enough to see the murderer flee.  I will therefore assign equal probabilities to the old man's evidence on either theory; in other words, his evidence is worthless.
  2. See (1).
  3. Time of death needs to correspond with the defendant being at home.
  4. The defendant cannot remember when interrogated the films he says he saw.  Allowing for a possible defect of memory or attention, I will allow a 50% probability of this happening even if he really saw them.  The failure is 100% expected if he were the murderer, and thus not at the cinema at all.  h: 100%, ~h: 50%, 2:1 for h.
  5. No alibi witness for the cinema.  Certain if he were the murderer, but possible if he were there but simply forgotten or not noticed.  h: 100%, ~h: 50%, 2:1 for h.
  6. The woman who saw the murder may need glasses due to being long-sighted rather than short-sighted.  Or she may habitually wear sunglasses.  I'll allow a generous 80% chance that she could not see what happened clearly but testified to it anyway.  h: 100%, ~h: 80%, 5:4 for h.
  7. There was an earlier argument that angered the boy, in which his father hit him twice, if we can trust the witnesses' hearing.  This supplies a possible motive, which makes him more likely to be the murderer than someone who had not argued.  But, given how many arguments take place, even when somebody is hit, that do not lead to revenge murders, this argument having happened does not greatly increase the chance of the defendant being the murderer in absolute terms.  h: 5%, ~h: 4%, 5:4 for h.  The right statistical thought-process here, is not to ask how likely it is that an argument is followed by a murder, but rather how likely it is that a murder is preceded by an argument.  So, on the assumption of h, how predictable was it that the defendant would turn out to have argued with and been hit by the victim, or in some other way been given cause for violence?  Highly likely: let's say 90%.  Whereas, on the assumption of ~h, how likely was the defendant to have argued with and been hit by his father, while not being the murderer?  This depends on how regularly such an event happens.  Given that (8) tells us such paternal violence was a regular occurence, let's guess at it happening once a week, giving a probability of 14%.  So: h: 90%, ~h: 14%, 6.4:1 for h.
  8. Similar issue to (7).  I'll allow more significance to regular beating as a probable background factor than a one-off argument and a couple of hits.  Let's say: h: 10%, ~h: 8%, 5:4 for h.  Again, the question should be: how likely was it that the murder would turn out to be preceded by a history of regular beatings of the defendant, if the defendant was the murderer?  Probably quite high, since violence begets violence, and murderers are more likely than the average, I suppose, to have been subjected to violence.  But probably not as high as the probability of a recent bout of violence having occasioned a murder.  Let's say 50%.  And how likely was it that the defendant would turn out to have been regularly beaten, if he were not the murderer?  That would be the general rate at which non-murdering young men are subject to childhoods full of beatings.  Let's say, for 1957, 1 in 10, so 10%.  Thus: h: 50%, ~h: 10%, 5:1 for h.
  9. The jury should not take his past record into consideration. 
  10. What is the chance of the defendant buying a replica of the future murder weapon shortly before the murder, then losing it, while somebody else commits the murder with such a weapon?  This would be very unlucky!  This is just the sort of excuse the defendant would have to make up if he were the murderer.  Let's say: h: 95%, ~h: 5%, 19:1 for h.
  11. I'll allow this assessment of the evidence: it was improbable the defendant would stab with this technique.  Let's say: h: 10%, ~h: 50%, 1:5 against h.
  12. Other people may have had a motive to kill the father.  What are the chances of this being the case if the defendant were guilty?  Obviously higher than for most people, given the father's behaviour.  What are the chances if the defendant were innocent?  A little bit higher still, since, on the hypothesis that the defendant was not guilty, somebody else did in fact commit the murder.  On the other hand, there are motiveless murders.  The fact that there were alternative potential murderers does not help the defendant much unless it was more likely that one of them would be the murderer, i.e. unless a plausible alternative culprit and series of events could be suggested by the defence.  But it helps a little to have unspecified alternatives.  h: 40%, ~h: 50%, 4:5 against h.
  13. The defendant returning home looks good for his innocence, as he might expect to be arrested if he were not in fact out at the cinema and thus ignorant of what had occurred.  Or he might have returned to retrieve the murder weapon.  I would say the former argument is stronger: h: 10%, ~h: 100%, 1:10 against h.


Conclusion

Now we multiply the ratios for each piece of significant evidence together:

(2x2x5x6.4x5x19x1x4x1) / (1x1x4x1x1x1x5x5x10) = 48,640 / 1,000 = 48.6 / 1.

Thus I estimate the defendant was over 48 times more likely to be guilty than innocent.

Expressed as a percentage, I rate him as 98% likely to be guilty.

So even once you throw out the old man's evidence as false witness, there really is a case beyond reasonable doubt.

So the 12 Angry Men were probably wrong to let the defendant go free!