Monday, October 23, 2006

statistics chapter 3 & 4

Statistics -- Chapter 3

summarizing data

Perimeter -- a descriptive measure of a population

statistic -- a descriptive measure of a sample

arithmetic mean -- a variable is computed by determining the sum of all the values of the variable in the data set, divided by the number of observations.

Population arithmetic mean (called a mew) -- is computed using all the individuals in a population. The population mean is a perimeter

sample arithmetic mean (called x-bar) -- is computed using sample data. The sample mean is a statistic

median -- this variable is the value that lies in the middle of the data when a range in ascending order. Half the data are below the median and half the data are above the median. We use M. to represent the median

mode -- a variable is the most frequent observation of the variable that occurs in the data said

bimodal -- the data set has two modes

compute the range of the variable from raw data

the simplest measure of dispersion is the range. To compute the range, the data must be quantitative.

The range, R, of a variable is the difference between the largest data value and the smallest data value.

Range = R = largest data value - smallest data value

Population variance -- the population variance of a veritable is the sum of the squared deviations about the population mean divided by the number of observations in the population

computational formula -- determines the population variance

bias -- overestimate and underestimates in a perimeter

weighted mean -- the weight did mean of the variable is found by multiplying each value of the variable by its corresponding weight, summing these products, and dividing the result by the sum of the weights

Z. score -- represents the distance that a data value is from the mean in terms of the number of standard deviations. It is obtained by subtracting the mean from the data value and dividing this result by the standard deviation

Summary

Measures of central tendency are used to indicate the typical value and a distribution. The mean measures the center of gravity of the distribution. The median separates the bottom 50% of the data for the top 50%. Both measures require that the data be quantitative. The mode measures the most frequent observation. The data can be either quantitative or qualitative to compute the mode. The median is resistant to extreme values, while the mean is not. A comparison between the median and the mean can help determine the shape of the distribution.

Measures of dispersion described the spread in the data. The range is the difference between the highest and lowest data value. The variance measures the average square deviation about the mean. The standard deviation is the square root of the variance. The mean and standard deviation are used in many types of statistical interference.

The mean, median, and measured can be approximated from grouped data. The variance and standard deviation can also be approximated from grouped data.

We can determine the relative position of any observation and a data set using Z- scores and percentiles. Z- scores denote how many standard deviations on observations is from the mean. Percentiles determine the percent of observations that lie above and below observation. The upper and lower fences can be used to identify potential outliers. Any potential outlier must be investigated to determine whether it was the result of a data entry air or, some other error in the data collection process, or of an unusual value in the data set.

The interquartile range is also a measure of dispersion. The five number summary provides an idea about the center and spread of a data set, through the median and the interquartile range. The length of the tales in the distribution can be determined from the smallest and largest data values. The five number summary is used to construct box plots. Box plots can be used to describe the shape of the distribution.

Statistics -- Chapter 4

relations between variables

Response variable -- the variable whose value can be explained by the value of the explanatory or predictor variable

scatter diagram -- graph that shows the relationship between two quantitative variables measured on the same individual. Each individual in the data set is represented by a point in the scatter diagram. The explanatory variable is plotted on the horizontal axis and the response variable is plotted on the vertical axis. Do not connect the points when drawing a scatter diagram

least squares regression criterion -- the least squares regression line is the one that minimizes the sum of the squared errors (or residuals). It is the line that minimizes the square of the vertical distance between the observed values and those predicted by the line

coefficient of determination -- measures the percentage of total variation in the response variable that is explained by the least squares regression line

Summary

the first step in identifying the type of relation that might exist is to draw a scatter diagram. The explanatory variable is plotted on the horizontal axis and the corresponding response terrible on the vertical axis. The scatter diagram can be used to discover whether the relation between the explanatory and the response variables is linear. In addition, for linear relations, we can judge whether the linear relationship is positive or negative association.

A numerical measure for the strength of linear relation between two quantitative variables is the linear correlation coefficient. It is a number between -- 1 and 1, inclusive. Values of the correlation coefficient near -- 1 are indicative of a negative linear relation between the two variables. Values of the correlation coefficient near +1 indicate a positive linear relation between the two variables. If the correlation coefficient is near zero, then there is little linear relation between the two variables.

Once a linear relation between the two variables has been discovered, we describe the relation by finding the least squares regression line. This line best describes the linear relation between the explanatory and the response variables. We can use the least squares regression line to predict a value of the response terrible for a given value of the explanatory variable.

Whenever a least squares regression line is obtained, certain diagnostics must be performed. These include verifying that the residuals have constant variance, verifying that the linear model is appropriate, and checking for outliners and influential observations.

What I worth mentioning again is that a researcher should never claim causation between two variables in a study unless the data are experimental. Observational data allows us to say that two variables might be associated, but we cannot claim causation.