What are the two main types of variables?
Categorical Variables and Continuous Variables.
What are categorical variables?
For a categorical variable the response can be categorised into a number of distinct groups.
What is a binary categorical variable?
A binary categorical variable is a categorical variable for which there are only two possible responses.
What is an ordinal categorical variable?
An ordinal categorical variable is a categorical variable for which there are 3 or more categories and there is some logical order.
What is a nominal categorical variable?
A nominal categorical variable is a categorical variable for which there are 3 or more categories and there is no logical ordering to those categories.
What are continuous variables?
Continuous variables are those for which the responses are numerical an may take any value on a well-defined continuous scale e.g. height, weight etc.
How might you recode a continuous variable?
It is possible to recode continuous data as an ordered categorical variable. For example weight may be recoded as an ordered categorical data as underweight, optimal weight, pre-obese or obese, or to a binary variable as underweight or not. However, this will result in a loss of information and may restrict the statistical tests which can be carried out on the new variable.
Describe how we can summarise categorical variables.
To summarise categorical data (including binary data), we simply count up the number of observations in each category; these counts are called frequencies. We usually express these as proportions or percentages of the total number of individuals. We can then present these numbers either in table format or graphically.
Describe how we can summarise continuous variables. What summary measures can we compute for continuous variables?
For continuous variables, we can summarise the data graphically using a histogram or box-plots.
For continuous variables we can also compute summary measures of data location (mean, median etc) and spread (range, standard deviation etc).
Describe how you would go about creating a histogram representative of a continuous variable.
To produce a histogram, we need to first group the data into ranges, and then count the number of observations in each group. These counts are called frequency distribution. Identifying the lowest and highest values first helps you decide on how the data should be grouped. Remember, too few groups will men detail is lost but too many groups will result in hardly any information in each group. Having formed the frequency distribution we can plot the number in each range to get a histogram. In a histogram the bars touch each other (unlike a bar chart) to indicate that the data is continuous.
What are the characteristics of a normal distribution as seen on a histogram?
What would a positively skewed distribution look like?
A positively skewed distribution would have a longer tail to the right.
What would a negatively skewed distribution look like?
A negatively skewed distribution would have a longer tail on the left.
Other than drawing a histogram how else may we summarise a continuous variable?
By computing summary measures, namely a measure of where the data is located, and a measure of the spread of the data.
What are the 3 summary measures of data location for continuous variables?
What are the 3 measures of spread or variation that can be calculated for continuous variables?
How do you work out the interquartile range for a data set of continuous data?
Lower quartile (QL) = 1/4 (n+1)th value Upper quartile (QU) = 3/4 (n+1)th value
IQR = QU - QL
How do you work out the standard deviation for a data set of continuous data?
What is the formula for calculating standard deviation?
sd = sqrt [ sum of (measurements - mean)^2 ] / [number of measurements - 1 ]
What is the formula for calculating variance?
sd = [sum of (measurements - mean)^2] / [number of measurements - 1]
For a symmetrical distribution what would you expect of the mean and the median values?
You would expect the mean and median values to be similar.
What do the following values for skewness indicate:
a. 0
b. 1
c. -1
a. a value of 0 for skewness indicates a normal distribution.
b. a positive value for skewness indicates a pile up of scores on the left of the distribution with a long right tail - positive skewing.
c. a negative value for skewness indicates a pile up of scores on the right of the distribution with a long left tail - negative skewing.
When the data are normally distributed what are the most appropriate measure of location and spread?
When the data are normally distributed the mean and standard deviation are the most appropriate measures of location and spread.
The standard deviation tells us about the spread of the data because if the data are normally distributed approximately 70% of the reading fall within 1 SD either side of the mean and approximately 95% of readings lie within 2 SDs of the mean.
If data are skewed, we usually present the median and IQR as they are less influenced by very high and very low values.
When data are not normally distributed (skewed) what are the most appropriate measures of location and spread?
If data are skewed, we usually present the median and IQR as they are less influenced by very high and very low values.