Saturday, June 09, 2007

frequency distribution, cross tabulation, and hypothesis testing

Chapter 15

Basic data analysis provides viable insights and guides the rest of the data analysis as well as the interpretation of the results. A frequency distribution should be obtained for each variable in the data. This analysis produces a table of frequency counts, percentages, and cumulative percentages for all the values associated with that variable. It indicates the extent of out of range, missing, or extreme values. The mean, mode, and median of a frequency distribution are measures of central tendency. The variability of the distribution is described by the range, the variants or standard deviation, coefficient of variation, and interquartile range. Skewness and kurtosis provide an idea of the shape of the distribution.

Cross tabulations are tables that reflect the joint distribution of two or more variables. In cross tabulation, the percentages can be computed either column wise, based on column totals, or row wise, based on row totals. The general rule is to compete the percentages in the direction of the independent variable, across the dependent variable. Often the introduction of a third variable can't provide additional insights. The Chi Square statistic provides a test of the statistical significance of the observed association in a cross tabulation. The phi coefficient, contingency coefficient, Cramer's V, and the lambda coefficient provide measures of the strength of association between the variables.

Parametric and non-parametric tests are available for testing hypothesis related to differences. And the parametric case, the t test is used to examine hypotheses related to the population mean. Different forms of the t test are suitable for testing hypotheses based on one sample, two independent samples, or paired samples. In the nonparametric case, popular one sample tests include the Kolmogorov-Smirnov, chi-square, runs test, and the binomial test. For two independent nonparametric samples, the Mann-Whitney U test, median test and the Kolmogorov-Smirnov test can be used. For paired samples, the Wlicoxon matched-pairs signed-ranks test and assign tests are useful for examining hypotheses related to measures of location.

frequency distribution -- a mathematical distribution whose objective is to obtain a count of the number of responses associated with different values of one variable and to express these counts in percentage terms
measures of location -- a statistic that describes a location within a data set. Measures of central tendency described the center of the distribution
mean -- the average; that value obtained by summing all elements in a set and dividing by the number of elements
mode -- a measure of central tendency given as the value that occurs the most in a sample distribution
median -- a measure of central tendency given as the value above which half of the values fall and below which half of the values fall
measures of variability -- a statistic that indicates the distributions dispersion
range -- the difference between the largest and smallest values of distribution
interquartile range -- the range of distribution income passing the middle 50% of the observations
variants -- the mean squared deviation of all the values from the mean
standard deviation -- the square root of the variance
coefficient of variation -- a useful expression in sampling theory for the standard deviation as a percentage of the mean
skewness -- a characteristic of a distribution that assesses its symmetry about the mean
kurtosis -- a measure of the relative peakedness or flatness of the curve defined by the frequency distribution
null hypothesis -- a statement in which no difference or effect is expected. If the null hypothesis is not rejected, no changes will be made
alternative hypothesis -- a statement that some difference or effect is expected. Excepting the alternative hypothesis will lead to changes in opinions or actions
one tailed test -- a test of the null hypothesis where the alternative hypothesis is expressed directionally
two tailed test -- a test of the null hypothesis where the alternative hypothesis is not expressed directionally
test statistic -- a measure of how close the sample has come to the null hypothesis. It often follows a well-known distribution, such as the normal, t, or chi- squared distribution
type I error -- also known as Alpha error, occurs when a sample results lead to the rejection of a null hypothesis that is in fact true
level of significance -- the probability of making a type 1 error
type II error -- also known as beta error, occurs when the sample results lead to the non-rejection of a null hypothesis that is in fact false
power of a test -- the probability of rejecting the null hypothesis when it is in fact false and should be rejected
Cross tabulation -- a statistical technique that describes two or more variables simultaneously and results in tables that reflect the joint distribution of two or more variables that have a limited number of categories or distinct values
contingency table -- a cross tabulation table. It contains a cell for every combination of categories of the two variables
chi-square statistic -- the statistic used to test the statistical significance of the observed association and cross tabulation. It assists us in determining whether a systematic association exists between the two variables
chi-square distribution -- a skewed distribution and shape depends solely on the number of degrees of freedom. As the number of degrees of freedom increases, the chi-square distribution becomes more symmetrical
phi coefficient -- a measure of the strength of Association and the special case of a table with two rows and two columns
contingency coefficient (C) -- a measure of the strength of association in a table of any size
Cramer's V -- a measure of the strength of association used in tables larger than 2 x 2
asymmetric lambda -- a measure of the percentage improvement in predicting the value of the dependent variable, given the value of the independent variable and contingency table analysis. Lambda also varies between zero and one
symmetric lambda -- the symmetric lambda does not make an assumption about which variable is dependent. It measures the overall improvement when production is done in both directions
tau b -- test statistic that measures the association between two ordinal-level variables. It makes adjustment for ties and is most appropriate when the table of variables is square
tau c -- test statistic that measures the association between two ordinal-level variables. It makes adjustment for ties and is most appropriate when the table of variables is not square but a rectangle
Gamma -- test statistic that measures the association between two ordinal-level variables. It does not make an adjustment for ties
parametric tests -- hypothesis testing procedures that assume that the variables of interest are measured on at least an interval scale
non-parametric tests -- hypothesis testing procedures that assume that the variables are measured on a nominal or ordinal scale
t test -- a univariate hypothesis test using the t distribution, which is used in the standard deviation is unknown and the sample size is small
t statistic -- a statistic that assumes that the variable has a symmetric bell shaped distribution in the mean is known (or assumed to be known) and the population variants is estimated from the sample
t distribution -- symmetric bell shaped distribution that is useful for small sample testing
z test -- a univariate hypothesis test using the standard normal distribution
independent samples -- to samples that are not experimentally related. The measurement of one sample has no effect on the values of the second sample
f test -- a statistical test of the equality of the variances of two populations
f statistic -- the f statistic is computed as the ratio of two sample variances
f distribution -- a frequency distribution that depends on two sets of degrees of freedom -- the degrees of freedom in the numerator and the degrees of freedom in the denominator
paired samples -- and hypothesis testing, the observations are paired so that two sets of observations relate to the same respondents
paired samples t test -- a test for differences in the means of paired samples
Kolmogorov-Smirnov one-sample test - A one sample nonparametric goodness of fit test that compares the cumulative distribution function for a variable with a specified distribution
runs test -- a test of randomness for a dichotomous variable
binomial test -- a goodness of fit statistical test for dichotomous variables. It tests the goodness of fit of the observed number of observations in each category to the number expected under a specified binomial distribution
Mann-Whitney U test -- a statistical test for the variable measured on an ordinal scale comparing the difference in the location of two populations based on observations from two independent samples
two-sample median test -- non-parametric test statistic that determines whether two groups are drawn from populations with the same median. This test is not as powerful as the Mann- Whitney U
Kolmogorov-Smirnov two-sample test -- nonparametric test statistic that determines whether to his divisions are the same. It takes into account any differences in the two distributions including median, dispersion, and skewness
Wilcoxon matched-pairs signed-ranks test -- a nonparametric test that analyzes the differences between the paired observations, taking into account the magnitude of the differences
sign test -- a nonparametric test for examining differences in the location of two populations, based on paired observations, that compares only the signs of the differences between pairs of variables without taking into account the magnitude of the differences