The statistical study of data deals with two fundamental questions: How can we describe and understand a situation when we have all the pertinent data about it? How can we infer features of all the data when we know only some of the data?
The first three rules of statistics should be: Draw a picture, draw a picture, draw a picture. A visual representation of data reveals patterns and relationships, for example, the distribution of one variable, or an association between two variables.
The logic of statistical inference is to compare data that we collect to expectations about what the data would be if the world were random in some particular respect. Randomness and probability are the cornerstones of all methods for testing hypotheses.
This lecture defines and explores standard deviation, which measures how widely data are spread from the mean. The various methods of measuring data dispersion have different properties that determine the best method to use.
Any shaped curve can model a data set. This lecture looks at skewed and bimodal shapes, and describes other characteristically shaped classes of distributions, including exponential and Poisson. Each shape arises naturally in specific settings.
The most famous shape of distributions is the bell-shaped curve, also called a normal curve or a Gaussian distribution. This lecture explores its properties and why it arises so frequently-as in the central limit theorem, one of the core insights on which statistical inference is based.
One way we attempt to understand the world is to identify cases of cause and effect. In statistics, the challenge is to describe and measure the relationship between two variables, for example, incoming SAT scores and college grade point averages.
Probability accomplishes the seemingly impossible feat of putting a useful, numerical value on the likelihood of random events. Our intuition about what to expect from randomness is often far from accurate. This lecture looks at several examples that place intuition and reality far apart.
Sampling is a technique for inferring features of a whole population from information about some of its members. A familiar example is a political poll. Interesting issues and problems arise in taking and using samples. Examples of potential pitfalls are explored.
This lecture introduces a fundamental strategy of statistical inference called hypothesis testing. The method involves assessing whether observed data are consistent with a claim about the population in order to determine whether the claim might be false. Drug testing is a common application.
Headlines at election time frequently trumpet statistics such as: "Candidate A will receive 59 percent of the vote, with a margin of error of plus or minus 3 percent." This lecture investigates what this "margin of error" statement means and why it is incomplete as written.
When gathering data from which deductions can be drawn confidently, it's important to think ahead. Double-blind experiments and other strategies can help meet the goal of good experimental design.
Opening the second part of the course, which deals with applying statistics, this lecture focuses on two examples of courtroom drama: a hit-and-run accident and a gender-discrimination case. In both, the analysis of statistics aids in reaching a fair verdict.
An election assembles individual opinions into one societal decision. This lecture considers a surprising reality about elections: The outcome may have less to do with voters' preferences than with the voting method used, especially when three or more candidates are involved.
The challenge of choosing an election winner can be thought of as taking voters' rank orderings of candidates and returning a societal rank ordering. A mathematically similar situation occurs when trying to determine what type of engine lasts longest among competing versions.
Analyzing statistical data in sports is a sport of its own. This lecture asks, "Who is the best hitter in baseball history?" The question presents statistical challenges in comparing performances in different eras. Another mystery is also probed: "Is the 'hot hand' phenomenon real, or is it random?"
A discussion of strategies for estimating the number of Mark V tanks produced by the Germans in World War II brings up the idea of expected value, a central concept in the risky business of buying and selling insurance.
Tax authorities often need to set valuations for every house in a tax district. The challenge is to use the data about recently sold houses to assess the values of all the houses. This classic example of statistical inference introduces the idea of multiple linear regression.
Statistics can be used to deceive as well as enlighten. This lecture explores deceptive practices such as concealing lurking variables, using biased samples, focusing on rare events, reporting handpicked data, extrapolating trends unrealistically, and confusing correlation with causation.
This lecture addresses two topics that come up when applying statistics to social sciences: factor analysis, which seeks to identify underlying factors that explain correlation among a larger group of measured quantities, and possible limitations of hypothesis testing.
Medical treatments are commonly based on statistical studies. Aspects to consider in contemplating treatment include the characteristics of the study group and the difference between correlation and causation. Another statistical concept, regression to the mean, explains why quack medicines can appear to work.
Economics relies on a wealth of statistical data, including income levels, the balance of trade, the deficit, the stock market, and the consumer price index. A surprising result of such data is that the leading digits of numbers do not occur with equal frequency, and that provides a statistical method for detecting fraud.
Statistics is essential in sciences from weather forecasting to quantum physics. This lecture discusses the statistics-based research of Johannes Kepler, Edwin Hubble, and Gregor Mendel. In Mendel's case, statisticians have looked at his studies of the genetics of pea plants and discovered data that are too good to be true.
The importance of statistics will only increase as greater computer speed and capacity make dealing with ever-larger data sets possible. It has limits that need to be respected, but its potential for helping us find meaning in our data-driven world is enormous and growing.