Unit III - Module 6 - Basic Data Analysis Principles Flashcards

(103 cards)

1
Q

After you have collected and normalized your data, the next step is analyzing it

A

This module
will focus on the basic principles and tools of data analysis, both numerical and graphical, the
application of which is a crucial first step in developing a cost estimate or risk analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data Analysis Overview
Key Ideas

Visual Display of Information
Central Tendency of Data
Dispersion (Spread) of Data
Data accumulation
Outliers

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data Analysis Overview
Analytical Constructs

Descriptive statistics
* Mean, median, mode
* Variance, std deviation,

CV
* Functional forms

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data Analysis Overview
Practical Applications

Making sense of your data

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data Analysis Overview
Related topics

Parametric
Distributions
Probability and Statistics

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data Analysis Within The
Cost Estimating Framework

Past
Understanding your
historical data

A

Present
Developing estimating
tools
Mean = $34.19
Average cost

Future
Estimating the new
system
Confidence Interval = +/-$5.76
Confidence Intervals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Here we present univariate cost data, in this case, monthly natural gas bills for a condominium over
about a six-year period.

The past data that have accumulated are displayed as a histogram, which is
a common way to show the density of univariate data. (This and other methods, such as the box
plots and stem-and-leaf graphs shown in the Related and Advanced Topics section, are all
essentially variants of plotting numbers on a number line.) We will revisit this particular graph, but
the standard Excel labels for histograms indicate that the data points in each bar are less than the
value shown on the x-axis, so that the bar labeled 30 gives the frequency of gas bills that are greater
than or equal to $15.00 but less that $30.00.

A

In the present, we develop estimating tools, in this case, a simple average: about $34.19 per month.
To apply this average for estimating or budgeting purposes, we would like to get a sense of its
precision, which we can obtain by calculating a confidence interval.

In this case, we see that the true
mean of future monthly gas bills is very likely to be within plus or minus $5.76 of the calculated
sample mean. Note, however, that most individual observations are outside this interval: the hot
summer months are much cheaper, and the cold winter months are much more expensive. Thus,
knowing the average precisely helps us (for household budgeting purposes, say) only if we have a
cost-smoothing deal with our utility company. If we want to accurately predict individual bills from
month to month, we need to seek out a cost-driver variable (such as mean monthly temperature).

This would lead to a set of bivariate graphs such as those shown on the Cost Estimating Framework
slide in Module 8 Regression Analysis.
Note that estimating tasks are by definition always conducted in the present. We don’t travel back in
time to collect data, nor forward in time to make our estimates. This “triptych” Cost Estimating
Framework is simply meant to remind us that data, the basis for our estimates, always originates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data Analysis Outline

Core Knowledge
* Types of Data
* Univariate Data Analysis
* Scatter Plots
* Variables
* Axes and Function Types
* Data Validation
* Descriptive Statistics
* Outliers
* Rules of Thumbs

Summary

Resources

Related and Advanced Topics

A

This module will cover various types of data sets and the functional relationships that may be present
therein; how to uncover these relationships by scatter plotting the correct variables on the correct set
of axes; and how to perform elementary data validation by examining descriptive statistics,
appropriately treating potential outliers, and applying rules of thumb. As with all modules, we’ll
conclude the Core Knowledge section with a summary and present resources for reference and
further study.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Types of Data

Univariate
Bivariate
Multivariate
Time Series

The first step in data analysis is to think about what type of data you have. While we present this as
a linear process, you’ll need to revisit this step each time you get new data. We’ll cover univariate,
bivariate, multivariate, and time series data sets.

A

Univariate
‐ Single variable
‐ Use descriptive and inferential statistics

Bivariate
‐ One independent variable and one dependent
variable (i.e., y is a function of x)
‐ Use descriptive and inferential statistics

Multivariate
‐ Several independent variables and one dependent
variable (i.e., y is a function of x1, x2, and x3)
‐ Use descriptive and inferential statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Univariate data consists of a single variable, such as cost data for a single element or a set of
historical cost growth factors for various programs in a given phase. It can be displayed graphically
using histograms (shown), stem-and-leaf plots, or boxplots.

A

In this case, descriptive statistics (mean,
median, standard deviation, Coefficient of Variation (CV), etc.) should be used to find the central
tendency and dispersion of the data.

Inferential statistics are not as frequently used with univariate data, but can be used to assess whether the data set seems to match a certain mean, variance, or
distribution.

Note that true univariate data sets are quite rare in cost estimating, but we start with them in our first-principles approach. As soon as you ascribe categories to data or add other metadata (such as the year of the costs), you are arguably in the realm of bivariate or multivariate
data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Bivariate data has one independent variable and one dependent variable. For example, software
development cost as a function of the number of lines of code. It is generally displayed using a
scatter plot (shown).

A

In this case, both descriptive and inferential statistics (regression, t and F statistics, etc.) should be used. Descriptive statistics are calculated to find the central tendency and dispersion of the dependent variable.

Then, inferential statistics are used to test the relationship between the independent and dependent variable, e.g., is the number of lines of code a good
predictor of software development cost?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Multivariate data has several independent variables and one dependent variable. For example, the
cost of supplies as a function of both crew size and hours underway. It can be displayed using a 3-D
plot (shown) or pairwise plots of the dependent variables against the various independent variables.
In this case, both descriptive and inferential statistics should be used.

A

As with the other data types,
descriptive statistics give an idea of the central tendency and dispersion of the dependent variable.
Inferential statistics – such as multiple regression – are used to test the relationship between the
independent variables and the dependent variable, e.g., do crew size and hours underway together
give a good prediction of supplies cost?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Types of Data

Time as the independent variable
* Interval matters! Make sure you use an XY (Scatter) and
not a Line Chart in Excel unless intervals are equally
spaced

Smooth trends are rarely found in time series

A

Possible rare exceptions (e.g., corrosion over time)

“Standard” trends such as investment and inflation
Look for paradigm shifts, cycles, autocorrelation

Use moving averages, divide data into groups and
compare descriptive statistics

Regression is often not useful as it only
picks up smooth trends unless AR1/ARIMA

ANOVA and mean comparisons are more useful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Time series data are quite different from univariate, bivariate, and multivariate data, and thus, a
somewhat different approach must be used.

Time series data are generally bivariate data with time
(in the form of year, quarter, month, etc.) as the independent variable: for example, cost growth as a
function of the year of program initiation, or worker productivity measured by quarter.

(For examples
like the latter, tracking production metrics over time, see Module 11 Manufacturing Cost Estimating.)

Time series data can also be used to check CERs by including a time variable; time data can reflect
different points in the life cycle of system. As with any bivariate data set, you can plot it on a standard
Cartesian xy-plane, but the time intervals must be plotted correctly on the x-axis.

In Excel, you
should generally choose the XY (Scatter) Chart type instead of Line; the latter may treat the x-axis as
Category instead of Time-scale, which will force the even-spacing of your data points. If your data are
equally spaced, you may choose to use a line chart.

A

Unlike other data types, we rarely see smooth trends in time series data. Something like corrosion
over time may be an example of a rare exception. There are also expected “standard” trends, such
as the fact that investments and inflation generally follow an exponential function over time due to the
compounding of rates (and one would hope that the former outpaces the latter!). Instead, we usually
see evidence of paradigm shifts, cycles, or autocorrelation. A paradigm shift would show a marked
change in the nature of the data occurring at some point or over some period. An example of a
paradigm shift is finding lower cost growth in programs entering production after 1986 (at the end of
the Reagan ramp-up period) than before 1986.

Since regression only picks up smooth trends, it will not detect shifts and cycles and therefore is
often not a useful tool in time series data. Instead, we should use scatter plots and moving averages
to look for possible paradigm shifts and cycles. In addition, we can divide the data into subgroups
(e.g., 1980-85, 1985-90, 1990-95, 1995-2000, and 2000-05) and compare descriptive statistics.
Analysis of Variance (ANOVA) can be used to test for significant differences between subgroups.
AR1/ARIMA (Auto Regressive Integrated Moving Average) models are models that forecast a value
in a time series as a linear combination of the past values, past errors and current and past values of
other time series.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Univariate Data Analysis

Visual Display of Information
* Histogram, stem-and-leaf, box plot
What does it look like?

Measures of Central Tendency
* Mean (or median or mode)
What’s your best guess?

Measures of Variability
* Standard deviation (or variance),
coefficient of variation (CV)
How much remains un-explained?

Measures of Uncertainty
* Confidence Interval (CI)
How precise are
you?

Statistical Tests
* t test, chi square test, Kolmogorov-Smirnov
(K-S) test
How can you be
sure?

A

This analysis framework is mirrored in
bivariate and multivariate analysis.

When conducting univariate data analysis, the following questions should be addressed (preferably in
the provided order). We will address methods used to answer these questions in the first section of
this module.

*What does it look like? Some useful visual displays in univariate data analysis are histograms,
stem-and-leaf plots, and box plots. Histograms, the most common graphical tool in univariate data
analysis, are discussed in the slides that follow; an explanation of stem-and-leaf plots and box plots
can be found in the Related and Advanced Topics section of this module.

*What’s your best guess? Measures of Central Tendency, such as the mean, median, or mode, are
useful when describing your “best guess” or probable outcome of a dataset. These measures are
single points used to represent the total data set.

*How much remains unexplained? You must also address the variability around your point
estimate (i.e., your mean). Common tools for measuring the variability include the standard deviation
(or variance) and the coefficient of variation.

*How precise are you? Confidence intervals are used to measure the certainty (or uncertainty)
around your point estimate. Confidence intervals are introduced in this module, and are discussed in
depth in Module 10 Probability and Statistics.

*How can you be sure? Statistical tests can be conducted with univariate data sets. Some tests are
introduced in this module; the “how” is left for Module 10 Probability and Statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Visual Display - Histograms

Histograms should be used to give an idea of the distribution of the data

Skew-right distribution, possibly
Exponential, Triangular, or Lognormal

Tip: Create histogram manually using Chart type
Column so that results do update when data
chang

A

One useful type of graph is the histogram. Here we have graphed six years’ worth of personal
monthly gas bills, measured in dollars (as first seen on the Cost Estimating Framework slide earlier).
This data, and the corresponding data on gas used per month (measured in units called therms), will
show up occasionally throughout this module. Though we would not use this specific data in day-to-
day cost estimating, we find it helpful here, since characteristics of these data can be used to
illustrate several of the topics we will explore (for example, time series and identification of outliers).

Histograms group data into several “bins” and plot the bins on the horizontal axis with the frequency
(or relative frequency) on the vertical axis. That is, the vertical column displays the number of
observations (or percentage of observations) that fall within that bin. By convention, the bin labels
indicate the upper end of the bin, so that the first bar represents the number of monthly bills less than
$15.00, the second bar between $15.00 and $30.00, and so on. Histograms give a good sense of the
distribution of the data, since they are essentially depictions of an empirical probability density
function (pdf), and can be useful in identifying potential outliers.

In the above histogram, a skew-right distribution is evident, leading us to investigate fitting a
triangular or lognormal (or even exponential) distribution to the data. The data points on the far right
are possible outliers.

Histograms are the primary means for cost estimators to visually display univariate data (and, as we
shall continue to emphasize throughout this module, it is extremely important to look at your data).
There are other possible methods, such as the stem-and-leaf plots and boxplots that are shown in
the Related and Advanced Topics section. Remember, of course, that monthly data is, strictly
speaking, not univariate – it has cost and month!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Visual Display – Histograms and their bins

  • It is important to carefully consider the number of bins used in a histogram
  • Experiment with intervals to be sure you understand the data
A

It is necessary to choose bins carefully. The two histograms shown use the same data set, but
different bin sizes. The histogram on the left allows Excel to automatically choose the number and
size of bins. Almost all of the data ends up in one bin, so we do not get a good idea of the
distribution.

In the histogram on the right, the analyst specified the number and size of bins to be
used. Here, the distribution is clearly skewed right with a potential outlier. In this case, however, the
lowest bin specified is $15, so we lose the fact that there is only one bill less than $11.15, and it is no
longer clear that the distribution has a “hump.” The bins on the right hide or distort, because they
create the impression that the distribution is exponential, when in fact, the data show a void under
some value … in other words, there are few-to-no near-zero months. If the distribution were truly
exponential, as the right histogram suggests, lower values around zero would be the most common.
As demonstrated in this example, poor choices of histogram bins can hide important information!
“Sometimes a beautiful graph is an orchid, sometimes it’s a Venus Flytrap.”

Because statistical samples have an inescapable random element, there is always a tradeoff in the
numbers of bins between texture (more bins) and smoothness (fewer bins). In general, you should
play around with the intervals for your bins to get a sense of the data before settling on the final
display. Follow the link to the Related and Advanced Topics section for some possible rules for
determining number of bins and bin width.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Central Tendency - Mean

The mean is the Expected Value of a random variable

In Excel, use the “AVERAGE( )” function

Means of example data sets:
Gas bill (74 months), $26.52
Therms used (74 months), 14.8

A

Throughout our discussion of descriptive statistics, we will refer to sample statistics (here, we
calculate the sample mean). Section on Probability and Statistics will further discuss the relationship
between your sample and the population from which it comes. Because we can never have perfect
knowledge of a population, our statistics generally represent best estimates of population parameters
based on our sample.

The arithmetic mean or average of a data set is simply the sum of the data values divided by the
number of data points. In this case, we add up the costs of all the bills (and corresponding therms
used) and divide by the number of bills, 74.

You can use the =AVERAGE() function in Excel, which
automatically skips blanks in a range of cells.

Follow the link to the Related and Advanced Topics section if you want to learn a mental math trick
for more easily calculating means (or at least approximations) in your head.

Recall that the arithmetic mean (AM) is distinct from the geometric mean (GM) and harmonic mean
(HM) introduced in Module 5 Inflation and Index Numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Central Tendency - Median

The sample median is the “middle” data point, with 50% of the remaining observations falling under that
point, and 50% above
If a data set has an odd number of points, the middle value is the median
 The median of the data set {2,5,7,9,25} is 7
If a data set has an even number of points, the two middle values are averaged
 The median of the set {3, 6, 8, 11, 13, 30} is 9.5 (average of 8 and 11

A

In general, the kth percentile is the point with k% of the data below and (100-k)% of the data above
 Quartiles (25, 50, 75), deciles (10, 20,…, 80, 90), icosatiles (5, 10, 15,…, 95)

When there are extreme data points, the median may be more representative than the mean because robust
outliers impact the mean more than the median
 “Representative” is a descriptive term, not a mathematical term
 There are many mathematical reasons to prefer mean over median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Mean, Median, and Skew

The mean and the median are equal if the distribution is symmetric

Unequal means and medians are an indication of skewness

Median < Mean
Skew(ed) Right

Median = Mean
Symmetric

Median > Mean
Skew(ed) Left

A

If the distribution or data set in question is symmetric, the mean and the median will be equal (as
illustrated by the normal distribution in the center of the slide). While the other two relationships
illustrated above are not unfailingly true, they generally hold for the “regular” types of distributions
common in cost estimating and risk analysis. (Check out the Wikipedia article on skewness for the
“legal disclaimer.”) Inequality of mean and median is a general indication that the distribution or data
set is skewed.

If the median is lower than (to the left of) the mean, as illustrated by the lognormal
distribution on the left, this is an indication that the distribution or data set is skewed right or skew
right, since it stretches out to the right.

A median higher than (to the right of) the mean, as illustrated
by the beta distribution on the right, is an indication that the distribution or data set is skewed left or
skew left. Notice that the direction of skew follows the “tail” of the data and not the “hump”.
In these continuous distributions, the blue median line splits the area under the curve exactly in half.
The mean, shown in red, is the point at which the x-axis would balance if the pdf indicated its linear
density.

The mode, which we’ll introduce next, is the x-value at which the peak of the distribution
occurs. Note that for unimodal distributions such as the ones shown, the mode falls on the opposite
side of the median from the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Central Tendency - Mode

The sample mode is the most frequent point to occur in a data set
 The mode of a distribution is its peak
 Value with the greatest probability mass (or density)

The mode of the set {2,4,4,7,9,9,9} is 9

The mode is a descriptive metric answering the question “what happens most frequently?”
 It can help give a visual idea of what the distribution looks like
 Most useful in discrete data

A

The mode is defined as the most frequently occurring point. This is the least used of the three
measures of central tendency. The measure is used to answer the question “what happens most
frequently?” so it’s only useful when the mode is fairly common. Note that the mode is simply a
plurality, not a majority. If the mode is a point that only occurs three times in a set of 100 data points,
it’s not really common enough for us to expect it to happen. More often we look at the modal bin for a
histogram, indicating the most likely range of values when compared with other adjacent non-
overlapping intervals of equal width.

The mode of a distribution is its peak, or the value where it attains its greatest probability mass (in the
discrete case) or density (in the continuous case). More on that in Module 10 Probability and
Statistics.

The mode is often most useful in discrete sets, particularly qualitative or categorical ones. For
example, the “mean color” of cows is useless, but the mode may be black-and-white piebald. This is
a most-frequent and discrete-case example, which makes the mode a good measure to use. A
quantitative discrete example would be a roll of a pair of dice, where the sum of seven (7) appears
most frequently, a key fact for you craps players out there! In continuous distributions, since the
probability of any point is zero, the “most common” idea is less useful: what you “expect” out of a
continuous random variable is the mean (the “expected value”), not the mode (the humped part of the
pdf).

The mode can be a good parameter to describe a distribution; it helps to get a sense of what the
picture looks like. The bottom line is that the mode is a visual parameter, one way or the other, not a
mathematically useful parameter. It is most useful in discrete data, and most useful when highly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Variability –
Variance / Standard Deviation

The sample variance measures the deviation of the data points from their mean

In Excel, use the “VAR( )” function

The sample standard deviation is simply

The standard deviation is expressed in the same units as the original data
 In Excel, use the “STDEV( )” function

A

The variance is the average squared distance of the data points from their mean; it is a measure of
the spread of a distribution. A lower variance indicates less dispersion (tighter data).

The standard deviation of a distribution is simply the square root of the variance and measures the absolute
distance of the data points from their mean. When the exact population distribution is not known,
which is always the case in practical applications, you can find the variance of the sample; the
sample standard deviation is again the square root of the variance.

When we find the sample variance, we divide the sum of the squares of the distances from the mean by (n-1) instead of n. This
is because the variance and standard deviation measure the distance from the mean, which is itself
calculated from the data; if we have collected n data points, the number of data points that can vary
independently from the mean is (n-1), since the nth data point is exactly determined by the mean and
the values of the other data points. This idea is called “degrees of freedom.” This denominator also
ensures that sample standard deviation is an unbiased estimator of population standard deviation. In
particular, because the sample mean minimizes the sum of squared deviations (an exercise for the
student!), the numerator is almost certainly smaller than if the true population mean were used, so
the denominator needs to be adjusted accordingly.
Note that while the formula on the left is easier to remember – it is essentially the definition – the
formula on the right is easier to calculate, since it involves fewer computational operations (squaring
each data point instead of having to take the delta from the mean first and then square it). In fact,
you can compute it with a simple two-column table, with the xs and the x squareds, presaging the
four-column table that we’ll see in Module 8 Regression Analysis. We recommend that you know
both formulae (and be able to derive the latter from the former in a pinch!).

The units of measure for variance are squared units, which is not a useful measure. Because of this,
the standard deviation is often the statistic reported as the measure of the spread of a data set. In
our earlier example (therms of natural gas used), the variance would be reported in squared therms;
the standard deviation would be reported in therms.
Variance and standard deviation will be revisited in

Module 10 Probability and Statistics. For now, try
to remember that sample statistics, like s, are generally the estimators of the corresponding
population parameters, usually denoted by the counterpart Greek letter. In this case, the sample
standard deviation, s, is the estimate for the population standard deviation, sigma (σ), which is the

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Variability - Coefficient of Variation

The Coefficient of Variation (CV) expresses the standard deviation as a percent of the mean

Tip: Low CV indicates less
dispersion, i.e., tighter data.
15% or less is desired

Large CVs indicate that the mean is a poor estimator
 Consider regression on cost drivers
 Examine data for multiple populations (outliers)

CVs of example data sets:
 Gas bill, 74.4% (69.2%)
 Therms used, 104.2% (102.5%)
Note that sums and averages tend to have smaller
variances

A

The coefficient of variation (CV), usually expressed as a percentage, is a measure of the size of the
standard deviation relative to the mean. This descriptive statistic is unit-less and therefore allows an
analyst to compare the variability across distributions.

In practice, a low coefficient of variation (say, 5%) would indicate that the average (mean) of the cost
data is a useful description of the data set. On the other hand, if the CV is much higher (say, greater
than 15%), there should be a cost driver in the data set that causes the cost to vary. This should
prompt the analyst to develop CERs in order to find the cost driver. If after running CERs the
coefficient of variation is not significantly reduced, you may have incorrectly identified the cost driver.

It is important to keep in mind that some data are inherently more noisy than other data, so the lack
of noise reduction may be the fault of the data rather than a misidentification of the cost driver. The
analogous calculation for CV for CERs is presented in

Module 8 Regression Analysis.
The CVs for the sample data set are shown, with the first number being the CV of all the individual
months and the second number (in parentheses) being the CV of the 12 monthly averages. This
illustrates the principle that sums and averages tend to have relatively smaller variances, as indicated
by the CV. In this case, the CVs aren’t decreased much, since it is the month-to-month variation in
temperatures that is driving the change in demand for gas, not the year-to-year variations for each
given month.

It is also interesting to note that the variation in demand (therms) is greater than the variation in cost.
This implies that the unit price for gas must be fluctuating in opposition to demand (negative
correlation) to provide a damping effect of sorts. This runs contrary to what we learned in Econ 101:
increased demand is supposed to drive prices up, not down!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Dispersion and CV

These two data sets have the same mean, but different standard deviations

This data has a lower CV (17%)
and is more tightly distributed

This data has a higher CV
(38%) and has more
dispersion

A

It is always important to look at the dispersion present in the data. Two different data sets may have
the same mean, but much different spreads. The data sets shown here illustrate this point. Both
sets have the same mean. However, the data on the left is more tightly distributed around the mean,
while the data on the right shows more dispersion. While both data sets have the same expected
value (mean), we would expect a significantly wider range when predicting based on the right-hand
data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Confidence Interval Illustration A confidence interval (CI) suggests to us that we are (1-)*100% confident that the true parameter value is contained within the calculated range*
A confidence interval or CI tells us, roughly, that we are (1-α)ꞏ100% confident that the true parameter value is contained within the calculated range. The end points of the interval are equivalent to the critical values calculated for a two-tailed hypothesis test. There is a one minus alpha probability of the true parameter being within the two critical values and an alpha chance of being above or below the critical values. For example, a 95% confidence interval has a 95% chance of containing the true value of the parameter, and a 2.5% chance each of being too low or too high. Two-tailed confidence intervals are the best estimate of a parameter – in this case, the mean cost of a data set – that follow the general form of mean plus or minus a defined number of standard errors. The number of standard errors is driven by the desired confidence level (i.e., whether you are 90% confident, 95% confident, etc.). In this case, the best guess for the population mean is the sample mean, x-bar, and the standard error of this estimate is the sample deviation divided by the square root of the number of data points (because we always know the mean more precisely that we know any given data point). These are both called out on the graph. The number of standard deviations is given by the critical value from a t distribution with n-1 degrees of freedom. In the illustrated case, we had to go a little bit farther than three standard deviations out into the tails of the bell-shaped t to capture 95% of the probability (alpha = 0.05). Confidence intervals can be computed for many different parameters and distributions. Confidence Intervals will be revisited in Module 8 Regression Analysis and Module 10 Probability and Statistics.
26
Sample Sizes - Sufficiently Large n In general, we prefer n to be large … how large is a function of our tolerance for error  The 68.3% CI for the mean is roughly CV/√n So, for CVs ranging around 30%, we get the following 68.3% Confidence Interval with n n +/- > 4 15% > 9 10% >16 8% >25 6% >36 5% If we would like to be able to make judgments within about 5% points with a CV of 30%, we need n » 36  We may have no choice but to deal with small n  In any case, we can calculate the range of estimated mean
It is very important to look at the size of your data set. Statistically significant differences among data sets cannot be shown with very small sample sizes. The sample size needed is a function of the underlying dispersion present in the population and of our tolerance for error. The Central Limit Theorem (informally) states that, for a random sample of size n, the sampling distribution of a sample statistic (most notably, the sample mean) can be approximated by a normal distribution as the sample size becomes large. From a practical standpoint, the natural question is “how large?” You may have heard before that your sample size must be of at least size 30. It is important to recognize that 30 is not a “magic number”; it is simply the sample size at which point most sampling distributions approximate a normal distribution (as determined by statistical research). Populations with small tails or little skewness won't require a sample that large; sampling distributions from highly skewed or heterogeneous populations require a greater sample. The Central Limit Theorem is discussed further in Module 10 Probability and Statistics. The mean of a sample of size n from a normally-distributed population is normally distributed with the same mean and a standard deviation equal to that of the population divided by the square root of n. Since the normal distribution has 68.3% of the probability density within one standard deviation of the mean, this leads to a nice rule of thumb: if your data is roughly normally distributed, an approximate 68.3% Confidence Interval (CI) for the mean, expressed as a percent, is the coefficient of variation divided by the square root of the sample size (CV/√n). Suppose we have a CV of around 30%. Then a sample size of 4 would give a 68.3% confidence interval of +/- 15%. That is, 68.3% of the time you will be within 15% of the true population mean. If we would like to make judgments within about +/- 5%, say, with a CV of 30%, we would need a sample size of about 36. In practice, we may be forced to deal with smaller data sets – for certain commodities there may not be that many different kinds of analogous systems in the universe! – and in many cases the data set will be small and out of the analyst’s control. In any case, we must always be aware of the range of error (cf. Confidence and Prediction Intervals in Module 8 Regression Analysis).
27
Prediction Intervals The previous confidence interval illustration gives the true average cost within a certain range If we want to know the predicted cost of a new item within a certain range, we need a prediction interval The PI suggests to us that we are (1-)*100% confident that the next observation will be contained within the calculated range The larger standard error in the PI accounts for both the uncertainty in the mean (captured by the CI) and the uncertainty in individual observations
A prediction interval or PI is used to predict the cost of a new item within a certain range. A prediction interval is a type of confidence interval that predicts the distribution of individual future points. The formula shown on this slide is used to calculate prediction intervals. Note that if we square the standard error from the CI (s/√n), add it to the sample variance (s2), and take the square root, we get the standard error for the PI! (This sort of Pythagorean relationship is common in calculations related to statistical variation.) The first term accounts for our uncertainty in estimating the mean, and the second accounts for the natural variation in observations. Prediction Intervals will be revisited in Module 8 Regression Analysis.
28
Statistical Tests t test for mean  Is the Cost Growth Factor (CGF) for NAVAIR programs different than 1.0? Chi square test for variance  Is 30% a reasonable CV to use for this variable? Should t test for equal means assume equal variances? Chi square test for distribution  Are Line-Replaceable Unit (LRU) failures uniform across all deployed units? Kolmogorov-Smirnov test for distribution  Is the normal distribution appropriate for modeling uncertainty in design weight?
Statistical tests may be performed on univariate data. To determine the type of test to perform, you should first consider the question you are trying to answer. For example, if you ask: *Is the Cost Growth Factor (CGF) for NAVAIR programs different than 1.0? Here, you should conduct a (one-sample) t test. *Is 30% a reasonable CV to use for this variable? A Chi square test for variance would be appropriate here. *Are Line-Replaceable Unit (LRU) failures uniform across all deployed units? A Chi square test for distribution (goodness of fit test) would be appropriate to determine the fit of the observed data distribution to a specified data distribution, often a uniform one. *Is the normal distribution appropriate for modeling uncertainty in design weight? You could use a Kolmogorov-Smirnov (K-S) test to determine whether or not your variance is normal. The K-S test is considered to be a more robust test than the Chi Square goodness of fit test.
29
Scatter Plots Variables Actions Function Types
After you have characterized your data set, the next step in data analysis is to scatter plot. Scatter plots provide the analyst with uncanny insight and a wealth of important information. In this section, we’ll discuss which variables you should plot on which kind of axes, and what types of functional relationships you might hope to uncover
30
Scatter Plots A picture is worth a thousand words!  A scatter plot can reveal a wealth of information about relationships present in the data Create scatter plots in Excel by using the Chart Wizard – XY (Scatter) Add a trend line in Excel by right clicking the plotted data and choosing Add Trend line  Helps link graph and equation  Look at inferential statistics later Scatter plots are the single most useful tool in all of analysis … they are “the gift of sight” to the analyst
Scatter-plotting is a crucial step in data analysis. After thinking about what type of data you have (univariate, bivariate, multivariate, time series), you should scatter plot the data. For all but univariate data, scatter plots are essential for getting a visual idea of what relationships are present in the data. These plots can help to detect trends in the data, possible shifts, potential outliers, etc. Scatter plots can easily be created in Excel by using the Chart Wizard and choosing XY (Scatter). Trend lines can be added to help link the graph and the regression equation. When you add your trend line, be sure to choose the options to display the R2 and regression equation. Remember that relationships must be tested for statistical significance, but this can wait until a bit later on. For now, the scatter plot will give us “the gift of sight.” Always do “meta analysis” (analysis of analysis). Many beautiful facts emerge from meta analyses, as well as many error discoveries. Meta analysis is one of the hallmarks of excellence. Graphics are the best form of both analysis and meta analysis.
31
Scatter Plots - Variables Plot cost (or other variable of interest, e.g., hours) as the dependent variable Look at a variety of different independent variables  Technical parameters such as weight, lines of code, etc.  Performance parameters such as speed, accuracy, etc.  Operational parameters such as crew size, flying hours, etc.  Cost of another element Think about which variables you believe should drive cost and collect that data!
What types of variables should be used when creating scatter plots? Generally, the cost of the element you are estimating should be treated as the dependent variable and graphed on the y-axis. You should consider a variety of different independent variables (graphed on the x-axis). For example, technical parameters that describe the system; operational parameters that indicate how the system will be operated; or the cost of another element that may influence or be correlated to this one. Thus, in a series of two-dimensional (2-D) graphs, the cost of the element to be estimated may be plotted against several independent variables. Resist the temptation to do this willy-nilly, ad infinitum! For now, look at as many independent variables as you believe to be possible cost drivers. When you are choosing a Cost Estimating Relationship (CER), you should pick relationships that are statistically significant (see Module 8 Regression Analysis) and make good engineering sense.
32
Scatter Plots – Cost Drivers Scatter plots can help identify cost drivers R2 interpretation: % of variation in y explained (linearly) by variation in x
Scatter plots are an important first step in identifying the most relevant cost drivers. (Cost drivers were first discussed in Module 3 Parametric Estimating.) The three scatter plots shown here graph cost as a function of three different variables. Cost and Variable 1 have the strongest correlation, so Variable 1 is a potential cost driver. Cost and Variable 2 are weakly correlated, and cost and Variable 3 are uncorrelated. Keep in mind that the r-squared value is merely an indicator of the relationship between two variables – t and F statistics must be checked to determine the statistical significance of the relationship (see Module 8 Regression Analysis). You should be aware that there is no magical “threshold” for r-squared. People will often espouse levels for r-squared above which a regression is useful, but in truth, if a regression is statistically significant, then any r-squared is of some help because it represents reduction in the variability with which we can predict the dependent variable. (It turns out there actually is a seldom-used beta test for r-squared in OLS regression!) As we’ll see later, r-squared can also be affected by (apparent) outliers, so be on the lookout for “barbells” on your graph. The bottom line is that r-squared is not the single best indicator of a good CER or cost driver relationship. You should also know that r-squared values tend to be higher or lower in different fields of endeavor. For example, in medical research and behavioral sciences, r-squared tends to be low because of the irreducible variability (beyond a certain point) of organisms and behavior, whereas in physics and other sciences, where variability can be mostly limited to measurement error, effects assumed to be negligible (friction, wind resistance, the curvature of the earth, the gravitational attraction of Pluto,…) and the like, variability can be quite small. Cost estimation is a middle case between the other two cited. Thus engineers are apt to expect high r-squareds even though cost estimation does not usually yield such values.
33
Scatter Plots – Unit Space Data should first be plotted in unit space* x is plotted on the horizontal axis (x-axis) and y is plotted on the vertical axis (y-axis) If the data have a non-linear relationship when plotted in unit space, investigate how the data can be “made” linear  Non-linear relationships can often be transformed to appear linear through the use of natural logs  Transformed data can then be regressed linearly  Before the widespread use of computers, non-linear data was graphed on semi-log or log-log paper
There are several different axes that can be used when scatter plotting. All data can be plotted in unit space (the standard Cartesian xy-plane), and this should always be your first step. In this case, x is the independent variable plotted on the horizontal axis, and y is the dependent variable plotted on the vertical axis. If your data are linear, you can perform all of your analysis in unit space. If your data are non-linear, it is easy to see in unit space that the data do not approximate a line; it is much more difficult to assume non-linear, look at the data in transformed space, and then determine that the data are linear. We would like to have linear data so that we can use linear regression to find cost estimating relationships (CERs). However, many relationships are non-linear. Fortunately, in many cases, transformations can be made to make the data linear. Then, linear regressions can be performed on the transformed data, and the results transformed back to unit space (see Module 8 Regression Analysis for more detail). Before computers made transformation and graphing quick and easy, non- linear data was usually plotted on semi-log or log-log graph paper. When exploring data that appear to be non-linear, it helps to understand the types of functions that may be underlying the data. We will focus on power and exponential functions, but other possibilities exist, such as logarithmic and polynomial functions. Even trigonometric functions may prove useful for modeling seasonal or other cyclical effects (cf. the earlier discussion on time series). The appropriate non-linear function may be identified by examining a scatter plot, but it also helps to know why a particular type of function might be used. For example, it is known from marine engineering that the energy consumption of a ship is directly proportional to the cube of its velocity.
34
Removing Outliers Do not remove an outlier from the data without a good reason!  Doing so removes some of the variation present in history  Doing so can be a form of “cooking the data” Good reasons for removing an outlier:  Program was restructured or divided  “One of these is not like the others”  e.g., a helo in a set of missile data Bad reasons for removing an outlier:  “Too high”  “2 standard deviations away from the mean” [!]
Removing outliers without adequate reason will remove some of the variation that is naturally historically present. This can be considered “cooking the data” and should not be done. (In the extreme case, if you iteratively removed the data point which was farthest from the mean, you’d end up with a data set of one with no variation at all. Reductio ad absurdum, indeed!) However, there are good reasons to remove outliers. Programs that are restructured or divided midstream may be outliers. For example, when looking at data from Selected Acquisition Reports for the F-14A/D, one would want to exclude later reports that are F-14D only. Also, programs that do not match the characteristics of the data set may be removed, e.g., a data point representing a helicopter does not belong in a set of missile data (or less markedly, an air-to-air missile may not belong in a data set of surface-to-air missiles). In general, outliers may be indicative of some of the “data issues” such as incomplete data we discussed in Module 4 Data Collection and Normalization, and should prompt you to re-ask “What’s in the number?!” Outliers should never be removed merely because they are judged to be “too high” or “too low.” This is subjective and often incorrect, as we will see when we discuss “Two Cautionary Tales” later in this module. In addition, points that are only two standard deviations away from the mean should not be removed as we would expect about 5% of the points to be that far away, as just noted.
35
Outlier Identification Rules Chauvenet’s Criterion Normal distribution properties Grubbs’ Test Normal distribution properties, where
Dixon’s Q Test Unclear. Will not detect two approximately equal outliers IQR‐Based Can customize k based on choice of distribution, α, and n. For example, in a normal distribution, k = 3 implies that < 5% of points should fall outside the range
36
Rules of Thumb Compare your descriptive statistics to historical rules of thumb Sanity check! Tip: Comparison to history and cross checks separates the thorough from the sloppy
You should always try to compare your results to historical data. This will add credibility and validation. Results that are vastly different from historical trends should raise a warning flag. Is there a reason that your data should be different from history? If not, you should be concerned about your results. One source of historical rules of thumb is the Standard Factors handbook available from the Naval Center for Cost Analysis (NCCA). Such factors and rules of thumb exist for various services, commodities, and organizations, and we recommend that you inquire locally. Even if rules of thumb or standard factors are not available, a good old-fashioned sanity check will often help ferret out inappropriate methodology or simply an erroneous result. A billion dollars for annual fuel costs of a single aircraft is not a reasonable number – at least not yet!
37
Data Analysis Summary Steps of basic data analysis 1. Scatter plot – visual depiction of the relationships in the data 2. Descriptive statistics – calculate the means and CVs If the CV is under 15%, the average may be a sufficient predictor, focus more attention on elements with higher CVs If the CV is over 15%, focus on this element using regression analysis to look for a better predictor than the average (CER development) 3. Look for outliers (data quality check) 4. Compare to history
After thinking about the type of data you have (univariate, bivariate, multivariate, time series), scatter plot the data. Scatter plots are “the gift of sight” to the analyst and will give a good idea of the relationships and trends present in the data. Next, calculate descriptive statistics for the variable of interest (presumably a particular cost element). For CVs under 15%, the simple average may be a good enough predictor. Leave these elements for now and focus on those with CVs of over 15%. Look for a better predictor than the average by using other cost techniques, including a parametric estimate via regression analysis (see Module 8 Regression Analysis). Then, examine the data for outliers. Only remove outliers if you have a good reason to do so! Finally, compare your descriptive statistics to standard factors, rules of thumb, or other historical data wherever possible.
38
If you are examining data to determine a CER to estimate the cost for electronics based on the number of circuits, which of the following terms best describes your data set? A Univariate B Bivariate C Multivariate
Choice Bivariate For each data point, you have two pieces of information: cost for electronics and the number of circuits. The number of circuits is the independent variable; the cost is the dependent variable. Therefore, this is a bivariate data set.
39
True or False. Regression is often a useful tool in analyzing time series data.
Choice B False Time series rarely demonstrates smooth trends. Regression only picks up smooth trends, and is therefore not useful in time series analysis. With time series, moving averages may be the more useful analysis.
40
Functions of the form y = a*xb can be plotted on what type(s) of axes? A. Unit B. Semi-log C. Log-log D. Choices A and B E. Choices A and C F. Choices B and C G. Choices A, B, and C
Choice E Choices A and C In unit space, the power trend would appear non-linear (power curve). In log-log space, data following a power curve should approximate a straight line, with the slope (b) corresponding. to the exponent in the original power equation, and the y-intercept (ln a) corresponding with the natural logarithm of the coefficient.
41
Suppose your (univariate) data set has a CV of 40%, and you want to make a judgment within about five percentage points. Approximately how many data points do you need? A. 4 B. 9 C. 16 D. 25 E. 36 F. 49 G. 64
G. 64 CV/sqrt(n)=5 40/sqrt(n)=5 sqrt(n)=8 n=64
42
The three histograms plot the same data with different bin sizes. All are shown on the same scale. Which histogram gives the best idea of the distribution?
This is somewhat subjective, though 1 and 3 appear to have bin sizes that are too large or too small. Histogram 2 is the "Goldilocks" graph ("just right").
43
Should Program 7 be removed from the data set as an outlier? Yes No
No 19.97 Average Cost 3.37 Standard Deviation of Cost 7.03 Delta 2.086491635 The number of standard deviations Project 7 is from the mean As shown at the above right, Project 7 is about 2.09 standard deviations above the mean of cost. In a normal distribution, we would expect about 4.55% of all data points to be outside of two standard deviations from the mean. Based on the information presented, we should not remove Project 7 from our data set.
44
How would you treat Program 13? A. Definitely leave the program in the data set B. Definitely remove the program from the data set C. Further investigate the program
C. Further investigate the program 18.95 Average Cost 4.74 Standard Deviation of Cost 17.45 Delta 3.685837901 The number of standard deviations Project 13 is from the mean Program 13 is more than 3 standard deviations from the mean, and is a possible outlier. This may be a transcription error (based on the other data, the cost may be missing a digit) and must be further investigated to determine whether it is a true outlier.
45
Which element seems to be the best cost driver? A Personnel B Weight C Hours of Operation
A Personnel Of the three, personnel has the strongest linear relationship, as indicated visually by the tight clustering of points around the predicted line and the high r-squared value. It also seems reasonable that Supplies costs would be most strongly related to the number of personnel manning a system, more so than the weight or operating hours of the system itself. Repair parts, on the other hand, would be expected to be more strongly related to system characteristics.
45
Which of the following statements about boxplots is correct? A. A boxplot ignores potential outliers. B. A boxplot is sometimes called a box-and-fingers plot. C. A boxplot gives a sense of the data spread. D. A boxplot is also called a steam-and-leaf plot.
C. A boxplot gives a sense of the data spread. One benefit of a boxplot is that potential outliers are easily identified. The lines on the boxplot are sometimes called "whiskers" and a boxplot is therefore sometimes called a "box-and-whiskers" plot. A steam-and-leaf plot is a completely different visual display from a boxplot.
46
Which of the following elements could best be estimated using a simple average? A. Food Cost B. Supplies Cost C. Repair Parts Cost
Average 9.66 9.83. 10.81 Standard Dev 0.88 4.02 2.96 Coefficient oCV) 9.1% 40.9% 27.4% The CV is lowest for food, indicating a small amount of variability. Because of this, the mean is a good estimate and a simple average may be used.
47
The linear function is a good first try when looking at new data for which of the following reason(s): A. Nearly all relationships are linear B. The linear function is a good approximation of most other function types C. Many relationships are linear D. All of the above E. Choices A and B F. Choices B and C G. Choices A and C
F. Choices B and C Many relationships are linear. In addition, though it is always best to appropriately identify the proper functional form, if the true relationship is linear and the functional form used is non linear, the misspecification causes greater errors than if the true relationship is non-linear and is approximated by a linear form.
48
Which of the following statements about transforming a power function into log space are true? A. The function can be plotted on a log-log graph B. The slope of the function in log space will be equal to the exponent in unit space C. The function can be plotted on a semi-log graph D. Choices A and B E. Choices B and C F. Choices A and C G. All of the above
D. Choices A and B When a power function is plotted in unit spact, the exponent of the equation is the slope of the line in log space. Exponential functions not power functions are appropriately plotted on a semi-log graph.
49
Which of the following choices is the correct weighted average cost? A. 17.27 B. 18.98 C. 20.13 D. 22.59
C. 20.13 The weighted average ensures that the average appropriately accounts for the quantities associated with each cost. For this data set, it is a better representation of avearge cost than the simple average ofthe cost data would indicate. SUMPRODUCT() is a handy Excel formula to use for weighted averages. 1328.543527 total cost 66 total units 20.12944738 average unit cost
49
Which of the following choices is the correct Coefficient of Variation (CV) of the data? A. 10% B. 14% C. 20% D. 26%
C. 20% Mean = 18.98 Std Dev = 3.824903035 CV = 0.201562168 The Coefficient of Variation (CV) expresses the standard deviation as a percent of the mean
50
Which of the following is the best measure of central tendency for a qualitative data set? A. Mean B. Median C. Mode D. A and B E. B and C F. A, B, and C G. None of the above
C. Mode The example of a qualitative or categorical data set given in the module is the color of cows. The mean or median color of a cow is meaningless. The mode - the value that occurs the most often - is a more useful measure in this case. The mode is a visual parameter, not a mathematically useful one.
50
True or False. The graph shown is skew right.
False False. The direction of the skew is defined by the "tail" of the data, not the "hump," so this distribution is skew left (or skewed left), meaning the extreme values are more on the left side than the right. This is the case when the mean (red line) is to the left of the median (blue line).
51
True or False. Statisical tests may be perfomed on univariate data
True. Statistical tests can be used in univariate data sets. Prime examples are the t test for mean, chi square test for variance, and Kolmogorov-Smirnov (K-S) and chi square tests for distribution.
52
True or False. You should scatter plot and regress your cost data against anything you think of because more information is always better than less.
False. Just as important as a clear statistical relationship is that your relationships make good engineering sense. It is possible to find statistical relationships in data that are meaningless for predictive use.
53
When comparing the mean of a data set and the median of a data set, which of the following can be true A. The median and the mean are equal B. The median is greater than the mean. C. The median is less than the mean. D. You should never compare these measures. E. A and B F. A and C G. A, B and C
G. A, B and C If the distribution of the data is symmetric, the mean and the median will be equal. If, however, the distribution is skewed, the median can be either greater than or less than the mean, depending on the direction of the skewness.
54
True or False. Excel defaults are the best choices for graphical displays of information.
False Understanding the presentation of the data has a great deal to do with the choices an analyst makes for the display. It is important to consider color, scale and space when adjusting these displays.
55
This module continues the concepts introduced in Module 4 Data Collection and Normalization and presents a simplified methodology for analyzing data by:
describing data types to be analyzed, providing calculations for performing the analysis, presenting techniques for visually representing the results, and applying various machine learning algorithms for advanced analysis.
56
This module centers around making sense of data, the basis of cost estimates. As shown in figure 6.1, data originates in the past, describes the present, and estimates the future. This module focuses mainly on the present, exploring the core analytical methods of cost estimating that are the subject of Unit III. Some well-known analytical constructs included are:
measures of central tendency such as mean, median, and mode; measures of dispersion including variance, standard deviation, and coefficient of variation (CV); functional forms as well linear, power, exponential, and logarithmic equations; and supervised and unsupervised machine learning algorithms.
57
The equations presented in this module are a key part of a best-fit Cost Estimating Relationship (CER) toolkit and provide a foundation for Module 8 Regression Analysis.
Many of the concepts regarding central tendency and dispersion introduced in this module are explored further in Module 10 Probability and Statistics.
58
Data Assessment Analytical thinking requires an unbiased assessment of data and marks the first step in developing a cost or risk estimate. The result of the assessment provides defensible support for decision-making
Exploratory Data Analysis is a term used to describe the technique of analyzing and investigating the data set and summarizing the main characteristics.
59
Data Assessment Types of Data There are several types of data sets, each of which determines the correct analytical methods
Univariate data consists of a single variable such as cost data for a single element or a set of historical cost growth factors for various programs in a given phase. It can be displayed graphically using histograms, bar graphs (shown in figure 6.1), or boxplots. Descriptive statistics (e.g., mean, median, standard deviation, CV) should be used to find the central tendency and dispersion of the data. Inferential statistics can be used to assess whether or not the data set seems to match a certain mean, variance, or distribution, but this is not often done with univariate data. True univariate data is quite rare in cost estimating. Bivariate or multivariate data emerges when data is categorized or when other metadata is added. Bivariate data has one independent variable and one dependent variable. An example of bivariate data is software development cost as a function of lines of code. It is generally displayed using a scatter plot. In this case, use both descriptive and inferential statistics (e.g., regression, t {\displaystyle t}, and F {\displaystyle F} statistics). Descriptive statistics find the central tendency and dispersion of the dependent variable, and inferential statistics test the relationship between the independent and dependent variable. Multivariate data has several independent variables and one dependent variable. An example of multivariate data would be the dependent variable of the cost of ship supplies as a function of two independent variables: crew size and hours underway. A 3-D plot or numerous pairwise plots of the dependent variable against the various independent variables can be used to display multivariate data. Time series data differs from univariate, bivariate, and multivariate data and requires a different approach. Time series data is generally a bivariate data set with time as the independent variable. For example, consider cost growth as a function of years since program initiation, or consider worker productivity measured by quarter. As with any bivariate data set, time series data can be plotted on a standard Cartesian x y {\displaystyle xy}-plane with time intervals plotted on the x {\displaystyle x}-axis. Use judgment to plot time scales, ensuring proper distribution of data points. Note that some software functions will force the even spacing of the data points, resulting in a poor and misleading distribution. Unlike other data types, smooth trends are rarely observed in time series data. Instead, there are irregularities due to paradigm shifts, cycles, or autocorrelation. Table 6.1 defines several key terms used with time series forecasting. Regression analysis identifies smooth trends. Regression will not detect paradigm shifts and cycles in time series data, but scatter plots and moving averages can help identify them. Dividing the data into subgroups (e.g., 1980-85, 1985-90, 1990-95, 1995-2000, and 2000-05) enables comparison of descriptive statistics. The Analysis of Variance (ANOVA) can be used to test for statistically significant differences between these subgroups. ANOVA is a statistical method in which the variation in a set of observations is divided into distinct components.
60
Data Assessment Time series terminology Paradigm Shift
Indicates a marked change in the nature of the data occurring at some point or over some period. An example of a paradigm shift would be lower cost growth in programs entering production due to a specific change in acquisition law.
61
Data Assessment Time series terminology Cycles
Repeating periodic trends and can occur at any interval. They are often found in seasonal data (e.g., higher electricity costs in winter and summer because of increased heating and cooling costs). Maintenance actions such as ship overhauls are another example of cyclical data.
62
Data Assessment Time series terminology Autocorrelation
Present when the variable (e.g., health, weather) in time t {\displaystyle t} is correlated to the variable in time t − 1 {\displaystyle t-1}, which is correlated to the variable in time t − 2 {\displaystyle t-2}. In other words, the value of the variable in the present is correlated to the value of the variable in the previous time period(s). Autocorrelation occurs due to dependencies within the data, usually when the data is from the same source.
63
Data Assessment Data Validation Data validation involves examining cost data descriptive statistics, assessing potential outliers, and comparing historical results Descriptive Statistics Descriptive statistics characterize and describe the data by revealing the central tendency of the data as well as the dispersion. They are calculated for each data group, especially for the cost data of the element in question.
Some important descriptive statistics include: sample size (i.e., the number of data points selected for analysis), mean, standard deviation, Coefficient of Variation (CV), and specialized averages.
64
Data Assessment Data Validation Specialized averages are either weighted averages for cost data representative of different quantities or moving averages for time series data. See Module 5 Inflation and Index Numbers and Module 11 Manufacturing Cost Estimating for more information and helpful examples of weighted and moving averages and how they are calculated
Statistical software programs can calculate a standard group of descriptive statistics for a given data set. It is better, however, to use the formula for each statistic separately, as these individual formulas, unlike the results of a macro, will update automatically if there is a change in the data. Note that a representative sample size is needed to make confident, meaningful, comparative conclusions. Graphical displays such as bar charts and histograms provide a visual way to compare descriptive statistics.
65
Data Assessment Data Validation Outliers Outliers are data points that fall far from the central mass of the data and may distort both descriptive and inferential statistics. In normally distributed data, 4.6% of the data is more than two standard deviations from the mean, since 95.4% of probability density falls within two standard deviations of the mean (figure 6.2). This amount of variation is expected, and arbitrarily removing these points from the database may falsely tighten the distribution. See Module 10 Probability and Statistics for more detail and for alternative methods for calculating outliers in data that is non-norm
When the data falls more than three standard deviations outside the mean, it should be reviewed to determine if it should be included. Since 99.7% of probability density falls within three standard deviations of the mean for a normal distribution, data outside this range is rare. Note that in addition to this common rule, analysts use a variety of other methods to identify outliers While the normal distribution is generally a good first approximation, significantly skewed data should not use these heuristics to identify potential outliers. he presence of outliers in the data can bias the results of the regression analysis. Figure 6.3(a) demonstrates data from a database which includes a potential outlier that is 4.24 standard deviations from the mean. Since the likelihood of having a point more than three standard deviations away is only about 0.3%, and four standard deviations even rarer, this point is definitely a possible outlier. Compare the result when the point is removed in figure 6.3(b): the slope of the regression line is steeper, and the R 2 {\displaystyle {\textrm {R}}^{2}} value is higher. While it is important to be aware of the impact of outliers on the regression analysis, outliers must never be removed without justification and documentation. Analysis of outliers can provide insight into potential data issues. Furthermore, variation may be naturally present, and removing outliers without adequate reason unjustifiably alters the data. In the extreme case, iteratively removing the data points farthest from the mean can result in a data set with no variation at all. Programs that are restructured or re-baselined midstream may be true outliers. For example, data from selected acquisition reports for the F-14A/D may contain data from the F-14D only. Removing the F-14D data is justified as there are several differences in the D variant. Also, programs that do not match the characteristics of the data set may be removed (e.g., a data point representing a helicopter does not belong in a set of missile data or, less extremely, an air-to-air missile may not belong in a data set of surface-to-air missiles). In general, outliers may be indicative of some of the data issues discussed in Module 4 Data Collection and Normalization and should prompt the analyst to explore the data set further.
66
Data Assessment Data Validation Standard Factors
Compare results to historical data whenever possible to validate the data. One example of a source of historical factors available to U.S. Navy analysts is the standard factors handbook available from the Naval Center for Cost Analysis (NCCA). Such factors and heuristics exist for various services, commodities, and organizations. Doing these comparisons will add credibility and validation. If heuristics or standard factors are not available, a Subject Matter Expert (SME) or stakeholder review will often help to eliminate inappropriate methodologies or erroneous results.
67
Central Tendency and Distribution of Data Collected and normalized data provides input to powerful statistical graphics and visualization tools. Estimates of parameters can be improved by analyzing more data. Data collection can be the longest and most tedious step of the cost estimating process. Once collected and normalized, an analyst must consider:
What does it look like? Some useful visual tools for data analysis are scatter plots, histograms, stem and leaf plots, and box plots. What is the best guess? Measures of central tendency, such as the mean, median, or mode, are useful when describing best guess or probable outcome of a data set. These measures are single points used to represent the total data set. How much remains unexplained? Variability, or mean, around the point estimate must be addressed. Standard deviation (i.e., variance) and the CV are common tools for measuring the variability. How precise is it? Confidence intervals are used to measure the certainty or uncertainty around the point estimate. Confidence intervals are discussed in depth in Module 10 Probability and Statistics. How can certainty be ensured? Basic statistical tests are introduced in this module. Advanced topics are presented in Module 10 Probability and Statistics. This analytical framework is repeated in the topics of bivariate and multivariate analysis discussed in Module 8 Regression Analysis.
68
Measuring Central Tendency Three key measures of central tendency include mean, median, and mode.
Mean The most popular measure of central tendency is the mean. The arithmetic mean (i.e., average) of a data set is the sum of the data values divided by the number of data points. For example, a sample data set has n number of values: x1 x2 x3... To calculate the mean, sum the x values and divide by the count of those values: X = (250) + (244) + (221) + (280) / 4 Often, the mean is not an actual value in the data set. Instead, it is a model of the data set and includes every value as part of the calculation. Because of this feature, the mean is especially sensitive to the influence of outliers. When outliers exert undue influence on the mean, consider using the median as the measure of central tendency. Median The median is the middle data point such that exactly half of the remaining data points are lower than the median and half are higher than the median. When calculating the median, order the data from the lowest to the highest value. The median is the middle data point such that exactly half of the remaining data points are lower than the median and half are higher than the median. If there is an odd number of data points, the median is simply the middle value. For example, the median of the data set {2,5,7,9,25} is 7. If there is an even number of data points, the two middle values are averaged to get the median. The median of the data set {3,6,8,11,13,30} is 9.5 (i.e., the average of 8 and 11). When there are outliers, the median may be more representative of a data set than the mean. For example, the median is the most appropriate measure of central tendency in the case of reporting family income because it is not distorted by income extremes in the population. By contrast, the mean is not representative of the average family because it includes extremes (e.g., the destitute and billionaires). In reporting median income, half of all families earn more than the median income and half earn less. Mode The mode is the most frequently occurring point and is the least used of the three measures of central tendency. The mode is often most useful in discrete sets, particularly qualitative or categorical ones and can be shown graphically using histograms. For example, the mean color of cows is useless, but the mode may be black-and-white piebald. This example is a most-frequent and discrete-case example, which makes the mode a good measure to use. A quantitative discrete example would be a roll of a pair of dice, where the sum of seven appears most frequently. Since the mode is used to determine whether or not a value occurs frequently in a data set, if the most frequent occurrence of a data point is infrequent, the predictive value of the mode is low. For distributions, the mode of the distribution is its peak. This is the value where it attains its greatest probability mass in the discrete case or its greatest probability density in the continuous case
69
Distribution Variance/Standard Deviation The variance is the average squared distance of the data points from the mean; it is a measure of the spread of a distribution. A lower variance indicates less dispersion, or spread (i.e., tighter data).
he sample number of data points is n the number of data points that can vary independently from the mean is (n−1), since the nth data point is exactly determined by the mean and the values of the other data points. This concept is called degrees of freedom. This denominator also ensures that the sample standard deviation is an unbiased estimator of the population standard deviation. The standard deviation of a distribution is calculated as the square root of the variance and measures the absolute distance of the data points from their mean:
70
Distribution Skew If the distribution of data in question is symmetric, the mean and the median will be equal as illustrated by the normal distribution in figure 6.4(b). While it is possible to contrive an asymmetric distribution where the mean and median are equal, asymmetric distributions like the ones shown ordinarily have unequal means and medians. If the mean and median of a distribution are not equal, it is said to be skewed.
If the median is lower than the mean, as illustrated by the lognormal distribution in figure 6.4(a), the distribution is said to be skewed right or skew right, since it stretches out to the right. A median higher than the mean, as illustrated by the beta distribution in figure 6.4(c), indicates that the distribution is skewed left or skew left. Notice that the direction of skew follows the tail of the data and not the hump.
71
Distribution Kurtosis Kurtosis represents a statistical measure that describes the shape of a distribution's tails in relation to a normal distribution. Specifically, kurtosis focuses on the extremities of the distribution and tells us whether the tail contain extreme values. In statistical software, kurtosis is often reported relative to the normal distribution by subtracting three from the raw kurtosis value and is reported as excess kurtosis. There are three types of kurtosis (figure 6.5)
Mesokurtic: The kurtosis of a normal distribution, often set as a baseline with a kurtosis value of 0. The distribution has a normal tail that follows the shape expected by a standard bell curve. Leptokurtic: Kurtosis higher than 0, demonstrated by fatter tails and a sharper peak. More data is found in the tails, indicating a higher likelihood of extreme values (e.g., a Laplace distribution). Platykurtic: Kurtosis less than 0, demonstrated by thinner tails and a flatter peak. More data is found evenly distributed around the mean, and extreme values are less likely (e.g., the uniform distribution).
72
Distribution Coefficient of Variation The Coefficient of Variation (CV) is a measure of the size of the standard deviation relative to the mean and is expressed as a percentage. This descriptive statistic is unitless and therefore allows for comparison of the variability across distributions.
Low CV indicates less dispersion (i.e., tighter data). In cost estimating, a CV of less than 15% is desired. In practice, a low CV (e.g., 5%) would indicate that the mean of the cost data is a useful description of the data set. If the CV is much higher (e.g., greater than 15%) there could be a cost driver in the data set that causes the cost to vary. This situation should prompt the analyst to develop CERs in order to find the cost driver. If after running CERs the CV is not significantly reduced, the cost driver may have been incorrectly identified. Some data are inherently more noisy than other data, so the lack of noise reduction may be the fault of the data rather than misidentification of the cost driver.
73
Distribution Dispersion Examination of the dispersion present in the data is necessary because two different data sets may have the same mean but very different spreads (standard deviation). This examination is best done graphically using bar charts or histograms
The data sets that are shown in figure 6.6 illustrate this point. Both sets have the same mean, however, the data in figure 6.6(a) is more tightly distributed around the mean, while the data on the right shows more dispersion. While both data sets have the same expected value (mean), a much wider range is expected when predicting future values based on the data i
74
Data Visualization Data Visualization is the principle of taking a data set and displaying it in a manner that can be easily understood. This can often involve translating complex quantitative and qualitative info easy-to-understand graphic or visual representations. Data visualization is fundamentally important throughout the data analysis process, from performing exploratory data analysis in the early stages of assessing your data set to sharing your final analysis results with decision makers.
Visualization tools reveal data distribution, patterns, trends, outliers, and missing data. Graphs can enable understanding; they make apparent what may otherwise be lost in the sea of data.
75
Data Visualization Scatter Plots Two-dimensional scatter plots are a useful graphical analytical tool for the cost estimator. Scatter plots graph two variables along two axes and can illustrate attributes such as central tendency and the dispersion. Scatter plots can:
help to detect trends in the data, possible shifts, or potential outliers; generate a visual representation of the data using commonly available software programs; and accommodate trend lines to help link the graph to the regression equation. When adding a trend line, display the R2 value and regression equation. Note that relationships must be tested for statistical significance further in the analysis process.
76
Data Visualization Variables In cost estimating the cost of the element to be estimated is generally the dependent variable and is graphed on the yaxis. The independent variables are those which drive cost and are graphed on the xaxis. Examples of such variables include technical parameters that describe the system, operational parameters that indicate how the system will be operated, or the cost of another element that may influence or correlate to total cost. When looking for possible cost drivers, plot the cost of the element to be estimated against relevant independent variables in a series of 2-D graphs. When choosing a CER, select relationships that are statistically significant (see Module 8 Regression Analysis) and operationally reasonable. Heat maps (see Heat Maps) are another useful graph to consider when assessing correlation between variables.
Performing analysis with scatter plots is the first step in identifying the most relevant cost drivers. Cost drivers are discussed in depth in Module 3 Parametric Estimating. The three scatter plots shown in figure 6.7 graph cost as a function of three different variables. Cost and Variable 1 have the strongest correlation as shown in figure 6.7(a). Therefore, Variable 1 is a potential cost driver. Cost and Variable 2 shown in figure 6.7(b) are weakly correlated, and cost and Variable 3 in figure 6.7(c) are uncorrelated. Keep in mind that the R2 value is merely an indicator of the correlation between two variables. The t- and F-statistics must be checked to determine the statistical significance of the relationship. These statistics are discussed in more detail in Module 8 Regression Analysis.
77
Data Visualization Axes and Function Types There are several different axes that can be used when scatter plotting.
All data can be plotted in unit space, the standard Cartesian xy-plane. Variable x is the independent variable plotted on the horizontal axis and y is the dependent variable plotted on the vertical axis. All analysis can be done in unit space if the data is linear. Linear data is preferred to enable the use of linear regression to find CERs. However, many relationships are non-linear. If the data is non-linear, it is easy to see in unit space that the data does not approximate a line. It is much more difficult to assume non-linear data; the analyst should look at the potentially non-linear data in a transformed space to determine if the data becomes linear. In many cases of non-linear data, this transformation can be made to make the data linear, allowing the analyst to perform linear regressions on the transformed data. The analyst can then transform the results back to unit space. Module 8 Regression Analysis discusses how to proceed with this transformation. When the correct transformations are applied, most cost driver relationships can manifest themselves as a linear trend. Linear functions are therefore the natural starting point when analyzing data for potential CERs.
78
Data Visualization Histograms Histograms provide a common way to show the density of univariate data by grouping the data into several bins and plotting the bins on the horizontal axis with the frequency or relative frequency on the vertical axis. That is, the vertical column displays the number of observations or percentage of observations that fall within that bin. Histograms give a good sense of the distribution of the data since they are essentially depictions of empirical PDFs and can be useful in identifying potential outliers
Choose the bins that the data is separated into carefully. The two histograms shown in figure 6.11 use the same data set of univariate cost data for condominium monthly natural gas bills but with different bin sizes. The histogram in figure 6.11(a) allows spreadsheet software to automatically choose the number and size of bins. Almost all of the data ends up in one bin and does not provide a good idea of the distribution. In the histogram in figure 6.11(b) the analyst specified the number and size of bins to be used. The bar labeled 30 indicates the frequency of gas bills that are greater than or equal to $15.00 but less than $30.00. A simple average for the monthly gas bills was estimated to be $34.19. A skew-right distribution and potential outliers become evident, leading the analyst to investigate fitting a triangular, lognormal, or even exponential distribution to the data. The data points on the far right appear to be possible outliers. In this case, however, the lowest bin specified is $15 which obscures the fact that there is only one bill less than $11.15. It is no longer clear that the distribution has a hump. The bins in figure 6.11(b) hide or distort the data because they create the impression that the distribution is exponential, when in fact, the data show a void under some value. In other words, there are few-to-no near-zero months. If the distribution were truly exponential, lower values around zero would be the most common. Thus, poor choices of histogram bins can hide important information.
79
Data Visualization Bar Charts Bar charts present categorical data as rectangles whose height or length correspond to their associated value for each category. They are an excellent way to compare descriptive statistics among different groups.
The bar chart in figure 6.12 displays the mean Cost Growth Factor (CGF) of the Development Estimate (DE), computed as a dollar-weighted average, for work performed by several different companies. As the chart title indicates, these data come from Selected Acquisition Reports (SARs) for developmental programs with an Engineering and Manufacturing Development (EMD) phase only. See Module 9 Cost and Schedule Risk Analysis for more information on SAR data. Standard deviations are shown for each group as a dashed yellow line. Note that these are standard deviations, not confidence intervals for the means. To get confidence intervals for the mean, divide the standard deviation by the square root of n. From this graph, it is apparent that Company 1 has the highest mean cost growth. The fact that Company 4 is a marginally better supplier than Company 3 is also apparent from the graph because the two values of mean and standard deviation are combined in a meaningful way. The mean for Company 4 is less than Company 3, and there is less variation in Company 4's cost growth. This information is discernible from the bar chart but much harder to determine from just looking at the data. Inferential statistical tests (in this case, t-tests) can be performed (see Module 10 Probability and Statistics), but a good graphical display like figure 6.10 can be effective in convincing the viewer
80
Data Visualization Box Plots Box plots, also called box and whisker diagrams, are a standardized way of summarizing the distribution of a data set.
Box plots display the five number summary: minimum, first quartile, median, third quartile, and maximum. In a typical box plot like the one shown in figure 6.13, the rectangle spans the range between the first and third quartile. The median is shown inside the rectangle, and the minimum and maximum values are displayed at the ends of the whiskers. Outliers are also plotted. The box plot in figure 6.14 shows exactly the same data as the histogram in figure 6.12. This box plot demonstrates that 25% of the time the data falls between $11 and $12. The median point is $16.50 indicating that 25% of the data is between $12 and $16.50. To the right of the median and contained within the box (from $16.50 to $37.00) is the next 25% of the data set. Finally, 25% of the data is to the right of the majority of the data set. On the far right are potential outliers. Given that this data set is the cost of monthly gas bills, is there an explanation for the potential outliers? Most of December, January and February bills are 4-5 times the cost of gas bills in the summer months; perhaps it was bitter cold during those winter months, or perhaps the owner stayed home all day and therefore consumed more. The important thing is to understand why these points appear to be outliers. Again, appropriate treatment of outliers was discussed in Outliers.
81
Data Visualization Heat Maps Heat maps are an important chart type for costs analysts, most commonly used when analyzing correlation between variables. Heat maps depict values for a main variable of interest across two axis variables using a grid of colored squares. The axis variables can be divided into ranges, similar to a bar chart or histogram, where the color of each cell indicates the value of the main variable in the corresponding cell range.
In the example show in figure 6.15, monthly precipitation for Seattle is plotted against buckets of rainfall accumulation for every day from 1998 to 2018. The darker shaded cells represent higher counts of total days with those levels of accumulation. For each month, we can quickly see what accumulations occurred more frequently. When used to illustrate correlation, heat maps can be referred to as a correlogram, which replaces each of the variables on the two axes with a list of numeric variables in the data set. Each cell depicts the relationship between the intersecting variables, such as a linear correlation. These are often used during the exploratory stage of data analysis, when assessing the correlation between independent and dependent variables, or assessing whether multicollinearity may exist between independent variables that are candidates for a parametric cost estimating relationship. In figure 6.16, the correlation between three variables for Irises are shown. By looking at the chart, for example, you can see the petal width and sepal length and petal width and petal length are highly correlated.
82
Data Visualization Waterfall Chart A waterfall chart can be used to separate pieces of a stacked bar chart which allows users to focus on one element at a time, show a starting point, increases and decreases, and the resulting ending point.
This can be particularly useful in cost estimating when highlighting changes between two estimates (i.e., baseline estimate and an update or what if case). Figure 6.17 shows two examples of waterfall charts. The first chart (figure 6.17(a)) demonstrates the change in program Estimate at Completion (EAC) over time, providing a unique visual to supplement earned value analysis. Beginning on the left, the first column shows the EAC as estimated in September of 2022. Moving right, we can see the changes in the EAC estimate by CLIN from September 2022 to August 2024. In figure 6.17(b), the forecasted change in inventory for parts is shown each year beginning in 2021 and moving through 2032. This chart supplements failure analysis to help estimate the cost of replacement parts through the program life cyc
83
Machine Learning Algorithms Machine Learning (ML) is a field of artificial intelligence in which algorithms learn from data to make predictions and inform decisions. These algorithms involve the application of statistical, mathematical, and numerical techniques. ML attempts to perform tasks sometimes associated with cognition. Analysts commonly use ML to perform tasks such as recommending products, facial recognition, and autonomous driving. ML is not a replacement for existing methods and applications of critical thought in cost estimation, but it offers an opportunity to enhance data analysis by providing additional tools to address specific use cases in the data analysis process.
It is important to have a basic understanding of machine learning to recognize both the opportunities and limitations associated with utilizing these tools. There are three primary techniques of machine learning which will be introduced: supervised, unsupervised, and reinforcement. A key difference between supervised and unsupervised learning is that supervised learning primarily analyzes labeled data, while unsupervised learning analyzes unlabeled data. A summary of the ML topics discussed in this module are illustrated in figure 6.18.
84
Machine Learning Algorithms Supervised Learning Supervised learning is a method of machine learning which utilizes trained algorithms to classify data or make predictions utilizing data that is properly labeled. When using this technique, there must be a clearly labeled true answer for the machine to be scored on
Supervised learning uses a data set with input variables and output variables from which it is trained to learn the relationship between the input and output data. Then once this algorithm is given new input data, it will predict an outcome for the unknown output data. Supervised learning algorithms fall into two main categories: classification (predicting categorical values) and regression (predicting numerical values).
85
Machine Learning Algorithms Classification Classification is an ML technique used to predict the label or class of unseen input data based on training data (ie. actuals). Similar to the existence of many options for performing regression analysis, many different algorithms exist that can perform classification. Five common algorithms will be introduced: Naïve Bayes, decision tree, random forest, k-Nearest Neighbors, and logistic regression.
Consider the following dataset in figure 6.19 of spacecraft payloads for a large space program. This table provides notional information on weight, power, payload class, and cost value. This dataset has been randomly generated for the purpose of ICEAA to represent a real-life dataset a cost analyst working on space systems might encounter. By plotting the data with weight on the x-axis, and power on the y-axis, the difference between classes can be visualized (figure 6.20); Class A/B instruments tend to be heavier and consume more power, while Class C/D instruments are lighter and use less power. To further illustrate the concept, imagine that the dataset now contains 1000 points for each class. Now it is clearly visible in figure 6.21 that the Class A/B instruments have a different mean and standard deviation than the Class C/D instruments in both Weight and Power (x and y axes
86
Machine Learning Algorithms Naive Bayes Naïve Bayes is a statistical model based on joint probability distributions. New predictions (inferences) are given by z-score probabilities (Baye's theorem):
where the probability of A given B is calculated as the probability of B given A times the probability of A divided by the probability of B. In this example (see table 6.2), we calculate the mean and standard deviation of each group (Class A/B instruments, and Class C/D instruments). For a known weight and power, we can predict the probability of belonging to each class based on the historical distribution of each, using Z-scores and the normal distribution.
87
Machine Learning Algorithms Decision Trees Decision trees are a common ML algorithm that is utilized in both classification and regression analysis. This method can work well with nonlinear data that is in groups; it creates linear branches by splitting data via yes/no questions. An application of this technique is depicted in figures 6.26 and 6.27.
This decision tree classifier has been fit to the data and generates two splitting points, which can be read as the following statement: If power < 1046, then “C/D”, else if weight < 1482 then “C/D”, else “A/B”. In this case, new data points with a power < 1046 and weight < 1482 would be predicted as Class C/D, and anything over 1046 Watts or 1482 kg would be predicted as Class A/B. This forms a boundary similar to the Naïve Bayes classifier, but with different methodology.
88
Machine Learning Algorithms Random Forests Random forests are ensemble models, a machine learning approach that combines multiple individual models
Ensemble models help overcome technical challenges with single estimators and can improve the overall prediction power of the model. Random forest algorithms are a specific type of ensemble model that work by creating multiple decision trees during training and then combine the outputs to make more accurate predictions. Each tree in the forest learns from a random subset of data and features. The resulting aggregated forest has better generalization and accuracy, helping to reduce model overfitting that might occur in an individual decision tree.
89
Machine Learning Algorithms Logistic Regression Logistic regression helps analysts solve binary classification tasks by estimating the probability of a certain class or event occurring
To do so, the logistic function is applied to a linear combination of input features to transform predictions into probabilities between 0 and 1. For a cost analyst, logistic regression can help make binary cost decisions, classify cost risks, and provide feature importance information via the coefficients assigned to each variable.
90
Machine Learning Algorithms Regression Regression is a supervised learning method that is used to predict numerical outcomes (e.g., cost as a continuous variable).
Several regression methods and techniques such as linear regression, non-linear regression, and multivariate regression are explored in more detail in Module 8 Regression Analysis. In the context of regression analysis in ML, the following algorithm techniques are introduced: linear regression, decision trees, random forests, and kNN. All these regression techniques can be used to predict a numerical outcome, and there is no best method to apply. Each type has their own strengths and weaknesses depending on the data being assessed.
91
Machine Learning Algorithms Linear Regression While typically associated with predicting continuous values, univariate and multivariate regression can also play an important role in machine learning ML classification problems. For more information about linear regression, see Module 8 Regression Analysis.
Univariate regression for classification examines the relationship between a single predictor variable and an outcome to predict a continuous target. For classification, these continuous predictions map to class labels, generally by setting threshold values. In other words, if a prediction falls below a certain threshold, the algorithm assigns the data point to one class, and if it falls above the threshold, to a different class. Note: Logistic regression is a form of univariate regression that deals with binary classification and models probabilities rather than continuous values. In cost estimating, consider the example visualized in figure 6.33: there is a single significant predictor, project duration, that has been observed to influence project costs. The analyst uses univariate regression on project duration to predict a continuous cost variable and then classifies the predicted cost into different cost categories (low, medium, and high) based on thresholds that reflect historical cost ranges. Multivariate regression for classification considers multiple predictor variables to estimate an outcome and then maps the continuous outputs to class labels. Techniques like Linear Discriminant Analysis (LDA), which deals specifically with continuous independent variables and a categorical dependent variable to reduce the dimensions of a data set, and polynomial regression, which models the relationship between the dependent variable and independent variables as n-degree polynomials, are special forms of multivariate regression that leverage multiple predictors for separating classes based on feature combinations. In cost estimating, consider the example visualized in figure 6.34: both project size and materials used affect the overall cost of a project, and these predictors can be used in multivariate regression with the outcomes then mapped to different categories of risk for budget overruns (low, medium, high) or into probabilities for likely to exceed budget or not likely to exceed budget (figure 6.35).
92
Machine Learning Algorithms Decision Trees for Regression Like the decision tree algorithm previously introduced for classification, a regression decision tree uses branches to split data into yes or no questions and can work well with nonlinear data in groups.
Rather than predicting a category when used for classification, regression decision trees predict a numerical value. The following example in figure 6.36 illustrates how a decision tree can predict spacecraft cost based on weight (kg) and design life (duration in years)
93
Machine Learning Algorithms Random Forests for Regression Random Forest is another common machine learning technique used for regression which combines multiple decision trees (an ensemble model) by averaging their results.
In this approach, each tree is made using a random subset of variables. In practice, this approach commonly uses 100 or more trees and is generally more accurate than using a single decision tree. However, this approach can be computationally expensive. Figure 6.37 illustrates an example of a random forest prediction.
94
Machine Learning Algorithms Unsupervised Learning Unsupervised learning differs from supervised learning in that it is not trying to predict a specific value (category or numeric) but rather attempts to find items that are similar.
Essentially, supervised learning leverages a labeled dataset to recognize patterns within predefined categories, while unsupervised learning analyzes unlabeled data to discover and form categories based on inherent similarities in the data. Examples of this include recommendation engines such as Netflix, Amazon, and Spotify. There is no correct movie or song to choose for you next; the algorithm finds ones that are similar to those that you have watched or liked in the past. This recommendation engine example utilizes a common algorithm in unsupervised learning: clustering ( k. Neural networks are another common unsupervised learning technique.
95
Machine Learning Algorithms Neural Networks Neural networks are ML algorithms used for complex, continuous predictions. These algorithms are based on computational models inspired by the way the human brain processes information. A model consists of interconnected nodes (neurons) that interact to solve complex problems. A neural network consists of an input layer where the model receives the initial information, hidden layers where computations occur, and the output layer that produces the final result. Neural networks use forward propagation to apply calculations and then backward propagation for learning
In cost estimating, neural networks can classify projects by learning complex, non-linear relationships between variables and cost-related outcomes. The model can train on historical cost data to learn patterns between multiple features. Suppose an analyst has a dataset of project features: project duration, project size, material and labor costs, and region. Each project has a label of either within budget, over budget, or significantly over budget. A neural network can train on all project features using the known budget labels to learn output probabilities for each budget class based on project characteristics. New projects can be fed into the model, and a probability is returned for each budget class. The model then selects the highest probability for classification. Figure 6.39 shows the neural network architecture diagram for this example.
96
Machine Learning Algorithms Reinforcement Learning Reinforcement learning is an advanced ML approach that responds to an environment in real time, with a specified goal. Examples include an autonomous vehicle with the goal to safely transport individuals from one location to another, or a computer capable of competing against a human in a game of chess with the goal to win. Some differences between reinforcement learning and supervised/unsupervised learning include:
no supervisor to guide the training; no training with a large pre-collected dataset (data is provided dynamically via feedback from the real-world environment in which you are interacting); iterative decision making over a sequence of time-steps where inferences are run repeatedly, navigating through the real-world environment as you go. Application of reinforcement learning is most used for control task or decision task in systems that interact with the real world. In cost estimating, this technique can make sequential decisions by interacting with an environment and receiving rewards or penalties based on the outcomes of its actions. An analyst can use this process to help optimize the allocation of resources to different projects. Analysts can also use reinforcement learning agents to monitor project changes and dynamically adjust cost estimates to reflect current conditions. The technical aspects of implementing this technique are considered advanced and are beyond the scope of this module.
97
Examples Validating Engineering Judgments Often, estimates are made based solely on engineering judgment, an informal SME assessment of costs based on their intuition of the program, with informal data analysis supporting the proposed cost. Validation of this informal assessment must occur before accepting an assumption that can later become a risk (e.g., expert opinion as in this case). Expert opinion carries the risks of heuristic and cognitive bias including:
availability heuristic, or overestimating events which have greater memory to the SME; conservatism, or insufficiently revising one's belief when presented with new data; Dunning-Kruger effect, where a SME may overestimate their ability; optimism bias, which leads to insufficiently assessing risk or true cost; and planning bias, or underestimating how long a task requires for completion. Without validation, data analysis, cross-checks, and documentation, the opinion holds little value to the customer. The cost analyst must make any assumptions formed from expert opinion explicit so that the estimate becomes reproducible. See more on expert opinion in Module 2 Costing Techniques. For more on this explicit documentation in the form of a Basis Of Estimate (BOE), see Module 14 Contract Pricing.
98
Examples Evaluating Outliers Incorrectly identifying outliers can cause errors in estimates, and mistakes are easy to make with inadequate data analysis. There are many advantages to applying multiple techniques which are not mutually exclusive. This example reinforces the importance of collecting, understanding, and analyzing the correct data and using multiple analytical techniques.
In this example, the data for four ships shown in figure 6.42 are analyzed for hours per ton required for each ship's shakedown (i.e., sea trial) where each ship's name corresponds to its hull tonnage (e.g., the tonnage for the hull of ship DD 963 is 963 tons). The program requires an estimate for the 5th ship, DD 967. From the historical data, the technical expert thinks that the factor 0.29 for DD 963 is too low for a first ship and discards this point as an outlier. Apply an objective method by scatter plotting the data to validate this assumption. Test the technical expert assumption by analyzing the plot in figure 6.43. Including DD 963, it is clear that all points except DD 980 fall along a straight line. Without DD 963, the trend line has such a steep slope that the resulting hours/ton from the curve is unrealistic at the 5th ship DD 967. It appears that the expert may have rejected the wrong outlier. The cost estimator should determine whether the expert may have correctly rejected the data point for DD 963 due to factors not known to the cost estimator as well as investigate the data point for DD 980. This process provides credibility to the analysis by testing the assumptions provided by the expert for validity. Outliers should not be removed unless they are truly unrepresentative of the data.
99
Summary
Good data analysis begins with good data. A successful analyst has knowledge of the data, understands the processes and techniques of analysis, and possesses the ability to think critically and objectively about results. Graphics are the best presentation form of data analysis and include scatter plots, histograms, and bar charts. Understanding the types of data available and data characteristics enable the analyst to properly select the right analytical methods. Finally, ML algorithms provide additional layers of data analysis and innovative ways of classifying and grouping project data.
100
Outlier Identification Rules
Chauvenet’s Criterion assumes a normal distribution and takes the tail probability and multiplies it by the number of data points, yielding the expected number of such extreme points. If fewer than one-half points were expected, or equivalently if the tail probability is less than 1/(2n), then the value may be considered an outlier. The formula may be tailored by replacing the phi function with the cumulative distribution function (cdf) of the desired distribution. Grubbs’ Test assumes a normal distribution and calculates a test statistic, G, as the maximum absolute deviation from the sample mean, divided by the sample standard deviation minus a Z score of sorts. If that test statistic exceeds the algebraic mess shown, then the null hypothesis of no outliers is rejected. Dixon’s Q Test uses a test statistic equal to the gap (i.e., distance between the prospective outlier and its nearest neighbor) divided by the range of the data set. A table of critical values must be used to evaluate the test. The IQR-Based approach is similar to that used in the boxplots presented in Box Plots. Outliers are identified as those points more than a certain multiple of the interquartile range (IQR) outside that range (i.e., below the first quartile or above the third