Unit III - Module 6 - Basic Data Analysis Principles Flashcards

Question

Confidence Interval Illustration A confidence interval (CI) suggests to us that we are (1-)*100% confident that the true parameter value is contained within the calculated range*

Answer 1

A confidence interval or CI tells us, roughly, that we are (1-α)ꞏ100% confident that the true parameter value is contained within the calculated range. The end points of the interval are equivalent to the critical values calculated for a two-tailed hypothesis test. There is a one minus alpha probability of the true parameter being within the two critical values and an alpha chance of being above or below the critical values. For example, a 95% confidence interval has a 95% chance of containing the true value of the parameter, and a 2.5% chance each of being too low or too high. Two-tailed confidence intervals are the best estimate of a parameter – in this case, the mean cost of a data set – that follow the general form of mean plus or minus a defined number of standard errors. The number of standard errors is driven by the desired confidence level (i.e., whether you are 90% confident, 95% confident, etc.). In this case, the best guess for the population mean is the sample mean, x-bar, and the standard error of this estimate is the sample deviation divided by the square root of the number of data points (because we always know the mean more precisely that we know any given data point). These are both called out on the graph. The number of standard deviations is given by the critical value from a t distribution with n-1 degrees of freedom. In the illustrated case, we had to go a little bit farther than three standard deviations out into the tails of the bell-shaped t to capture 95% of the probability (alpha = 0.05). Confidence intervals can be computed for many different parameters and distributions. Confidence Intervals will be revisited in Module 8 Regression Analysis and Module 10 Probability and Statistics.

Answer 2

It is very important to look at the size of your data set. Statistically significant differences among data sets cannot be shown with very small sample sizes. The sample size needed is a function of the underlying dispersion present in the population and of our tolerance for error. The Central Limit Theorem (informally) states that, for a random sample of size n, the sampling distribution of a sample statistic (most notably, the sample mean) can be approximated by a normal distribution as the sample size becomes large. From a practical standpoint, the natural question is “how large?” You may have heard before that your sample size must be of at least size 30. It is important to recognize that 30 is not a “magic number”; it is simply the sample size at which point most sampling distributions approximate a normal distribution (as determined by statistical research). Populations with small tails or little skewness won't require a sample that large; sampling distributions from highly skewed or heterogeneous populations require a greater sample. The Central Limit Theorem is discussed further in Module 10 Probability and Statistics. The mean of a sample of size n from a normally-distributed population is normally distributed with the same mean and a standard deviation equal to that of the population divided by the square root of n. Since the normal distribution has 68.3% of the probability density within one standard deviation of the mean, this leads to a nice rule of thumb: if your data is roughly normally distributed, an approximate 68.3% Confidence Interval (CI) for the mean, expressed as a percent, is the coefficient of variation divided by the square root of the sample size (CV/√n). Suppose we have a CV of around 30%. Then a sample size of 4 would give a 68.3% confidence interval of +/- 15%. That is, 68.3% of the time you will be within 15% of the true population mean. If we would like to make judgments within about +/- 5%, say, with a CV of 30%, we would need a sample size of about 36. In practice, we may be forced to deal with smaller data sets – for certain commodities there may not be that many different kinds of analogous systems in the universe! – and in many cases the data set will be small and out of the analyst’s control. In any case, we must always be aware of the range of error (cf. Confidence and Prediction Intervals in Module 8 Regression Analysis).

Answer 3

A prediction interval or PI is used to predict the cost of a new item within a certain range. A prediction interval is a type of confidence interval that predicts the distribution of individual future points. The formula shown on this slide is used to calculate prediction intervals. Note that if we square the standard error from the CI (s/√n), add it to the sample variance (s2), and take the square root, we get the standard error for the PI! (This sort of Pythagorean relationship is common in calculations related to statistical variation.) The first term accounts for our uncertainty in estimating the mean, and the second accounts for the natural variation in observations. Prediction Intervals will be revisited in Module 8 Regression Analysis.

Answer 4

Statistical tests may be performed on univariate data. To determine the type of test to perform, you should first consider the question you are trying to answer. For example, if you ask: *Is the Cost Growth Factor (CGF) for NAVAIR programs different than 1.0? Here, you should conduct a (one-sample) t test. *Is 30% a reasonable CV to use for this variable? A Chi square test for variance would be appropriate here. *Are Line-Replaceable Unit (LRU) failures uniform across all deployed units? A Chi square test for distribution (goodness of fit test) would be appropriate to determine the fit of the observed data distribution to a specified data distribution, often a uniform one. *Is the normal distribution appropriate for modeling uncertainty in design weight? You could use a Kolmogorov-Smirnov (K-S) test to determine whether or not your variance is normal. The K-S test is considered to be a more robust test than the Chi Square goodness of fit test.

Answer 5

After you have characterized your data set, the next step in data analysis is to scatter plot. Scatter plots provide the analyst with uncanny insight and a wealth of important information. In this section, we’ll discuss which variables you should plot on which kind of axes, and what types of functional relationships you might hope to uncover

Answer 6

Scatter-plotting is a crucial step in data analysis. After thinking about what type of data you have (univariate, bivariate, multivariate, time series), you should scatter plot the data. For all but univariate data, scatter plots are essential for getting a visual idea of what relationships are present in the data. These plots can help to detect trends in the data, possible shifts, potential outliers, etc. Scatter plots can easily be created in Excel by using the Chart Wizard and choosing XY (Scatter). Trend lines can be added to help link the graph and the regression equation. When you add your trend line, be sure to choose the options to display the R2 and regression equation. Remember that relationships must be tested for statistical significance, but this can wait until a bit later on. For now, the scatter plot will give us “the gift of sight.” Always do “meta analysis” (analysis of analysis). Many beautiful facts emerge from meta analyses, as well as many error discoveries. Meta analysis is one of the hallmarks of excellence. Graphics are the best form of both analysis and meta analysis.

Answer 7

What types of variables should be used when creating scatter plots? Generally, the cost of the element you are estimating should be treated as the dependent variable and graphed on the y-axis. You should consider a variety of different independent variables (graphed on the x-axis). For example, technical parameters that describe the system; operational parameters that indicate how the system will be operated; or the cost of another element that may influence or be correlated to this one. Thus, in a series of two-dimensional (2-D) graphs, the cost of the element to be estimated may be plotted against several independent variables. Resist the temptation to do this willy-nilly, ad infinitum! For now, look at as many independent variables as you believe to be possible cost drivers. When you are choosing a Cost Estimating Relationship (CER), you should pick relationships that are statistically significant (see Module 8 Regression Analysis) and make good engineering sense.

Answer 8

Scatter plots are an important first step in identifying the most relevant cost drivers. (Cost drivers were first discussed in Module 3 Parametric Estimating.) The three scatter plots shown here graph cost as a function of three different variables. Cost and Variable 1 have the strongest correlation, so Variable 1 is a potential cost driver. Cost and Variable 2 are weakly correlated, and cost and Variable 3 are uncorrelated. Keep in mind that the r-squared value is merely an indicator of the relationship between two variables – t and F statistics must be checked to determine the statistical significance of the relationship (see Module 8 Regression Analysis). You should be aware that there is no magical “threshold” for r-squared. People will often espouse levels for r-squared above which a regression is useful, but in truth, if a regression is statistically significant, then any r-squared is of some help because it represents reduction in the variability with which we can predict the dependent variable. (It turns out there actually is a seldom-used beta test for r-squared in OLS regression!) As we’ll see later, r-squared can also be affected by (apparent) outliers, so be on the lookout for “barbells” on your graph. The bottom line is that r-squared is not the single best indicator of a good CER or cost driver relationship. You should also know that r-squared values tend to be higher or lower in different fields of endeavor. For example, in medical research and behavioral sciences, r-squared tends to be low because of the irreducible variability (beyond a certain point) of organisms and behavior, whereas in physics and other sciences, where variability can be mostly limited to measurement error, effects assumed to be negligible (friction, wind resistance, the curvature of the earth, the gravitational attraction of Pluto,…) and the like, variability can be quite small. Cost estimation is a middle case between the other two cited. Thus engineers are apt to expect high r-squareds even though cost estimation does not usually yield such values.

Answer 9

There are several different axes that can be used when scatter plotting. All data can be plotted in unit space (the standard Cartesian xy-plane), and this should always be your first step. In this case, x is the independent variable plotted on the horizontal axis, and y is the dependent variable plotted on the vertical axis. If your data are linear, you can perform all of your analysis in unit space. If your data are non-linear, it is easy to see in unit space that the data do not approximate a line; it is much more difficult to assume non-linear, look at the data in transformed space, and then determine that the data are linear. We would like to have linear data so that we can use linear regression to find cost estimating relationships (CERs). However, many relationships are non-linear. Fortunately, in many cases, transformations can be made to make the data linear. Then, linear regressions can be performed on the transformed data, and the results transformed back to unit space (see Module 8 Regression Analysis for more detail). Before computers made transformation and graphing quick and easy, non- linear data was usually plotted on semi-log or log-log graph paper. When exploring data that appear to be non-linear, it helps to understand the types of functions that may be underlying the data. We will focus on power and exponential functions, but other possibilities exist, such as logarithmic and polynomial functions. Even trigonometric functions may prove useful for modeling seasonal or other cyclical effects (cf. the earlier discussion on time series). The appropriate non-linear function may be identified by examining a scatter plot, but it also helps to know why a particular type of function might be used. For example, it is known from marine engineering that the energy consumption of a ship is directly proportional to the cube of its velocity.

Answer 10

Removing outliers without adequate reason will remove some of the variation that is naturally historically present. This can be considered “cooking the data” and should not be done. (In the extreme case, if you iteratively removed the data point which was farthest from the mean, you’d end up with a data set of one with no variation at all. Reductio ad absurdum, indeed!) However, there are good reasons to remove outliers. Programs that are restructured or divided midstream may be outliers. For example, when looking at data from Selected Acquisition Reports for the F-14A/D, one would want to exclude later reports that are F-14D only. Also, programs that do not match the characteristics of the data set may be removed, e.g., a data point representing a helicopter does not belong in a set of missile data (or less markedly, an air-to-air missile may not belong in a data set of surface-to-air missiles). In general, outliers may be indicative of some of the “data issues” such as incomplete data we discussed in Module 4 Data Collection and Normalization, and should prompt you to re-ask “What’s in the number?!” Outliers should never be removed merely because they are judged to be “too high” or “too low.” This is subjective and often incorrect, as we will see when we discuss “Two Cautionary Tales” later in this module. In addition, points that are only two standard deviations away from the mean should not be removed as we would expect about 5% of the points to be that far away, as just noted.

Answer 11

Dixon’s Q Test Unclear. Will not detect two approximately equal outliers IQR‐Based Can customize k based on choice of distribution, α, and n. For example, in a normal distribution, k = 3 implies that < 5% of points should fall outside the range

Answer 12

You should always try to compare your results to historical data. This will add credibility and validation. Results that are vastly different from historical trends should raise a warning flag. Is there a reason that your data should be different from history? If not, you should be concerned about your results. One source of historical rules of thumb is the Standard Factors handbook available from the Naval Center for Cost Analysis (NCCA). Such factors and rules of thumb exist for various services, commodities, and organizations, and we recommend that you inquire locally. Even if rules of thumb or standard factors are not available, a good old-fashioned sanity check will often help ferret out inappropriate methodology or simply an erroneous result. A billion dollars for annual fuel costs of a single aircraft is not a reasonable number – at least not yet!

Answer 13

After thinking about the type of data you have (univariate, bivariate, multivariate, time series), scatter plot the data. Scatter plots are “the gift of sight” to the analyst and will give a good idea of the relationships and trends present in the data. Next, calculate descriptive statistics for the variable of interest (presumably a particular cost element). For CVs under 15%, the simple average may be a good enough predictor. Leave these elements for now and focus on those with CVs of over 15%. Look for a better predictor than the average by using other cost techniques, including a parametric estimate via regression analysis (see Module 8 Regression Analysis). Then, examine the data for outliers. Only remove outliers if you have a good reason to do so! Finally, compare your descriptive statistics to standard factors, rules of thumb, or other historical data wherever possible.

Answer 14

Choice Bivariate For each data point, you have two pieces of information: cost for electronics and the number of circuits. The number of circuits is the independent variable; the cost is the dependent variable. Therefore, this is a bivariate data set.

Answer 15

Choice B False Time series rarely demonstrates smooth trends. Regression only picks up smooth trends, and is therefore not useful in time series analysis. With time series, moving averages may be the more useful analysis.

Answer 16

Choice E Choices A and C In unit space, the power trend would appear non-linear (power curve). In log-log space, data following a power curve should approximate a straight line, with the slope (b) corresponding. to the exponent in the original power equation, and the y-intercept (ln a) corresponding with the natural logarithm of the coefficient.

Answer 17

G. 64 CV/sqrt(n)=5 40/sqrt(n)=5 sqrt(n)=8 n=64

Answer 18

This is somewhat subjective, though 1 and 3 appear to have bin sizes that are too large or too small. Histogram 2 is the "Goldilocks" graph ("just right").

Answer 19

No 19.97 Average Cost 3.37 Standard Deviation of Cost 7.03 Delta 2.086491635 The number of standard deviations Project 7 is from the mean As shown at the above right, Project 7 is about 2.09 standard deviations above the mean of cost. In a normal distribution, we would expect about 4.55% of all data points to be outside of two standard deviations from the mean. Based on the information presented, we should not remove Project 7 from our data set.

Answer 20

C. Further investigate the program 18.95 Average Cost 4.74 Standard Deviation of Cost 17.45 Delta 3.685837901 The number of standard deviations Project 13 is from the mean Program 13 is more than 3 standard deviations from the mean, and is a possible outlier. This may be a transcription error (based on the other data, the cost may be missing a digit) and must be further investigated to determine whether it is a true outlier.

Answer 21

A Personnel Of the three, personnel has the strongest linear relationship, as indicated visually by the tight clustering of points around the predicted line and the high r-squared value. It also seems reasonable that Supplies costs would be most strongly related to the number of personnel manning a system, more so than the weight or operating hours of the system itself. Repair parts, on the other hand, would be expected to be more strongly related to system characteristics.

Answer 22

C. A boxplot gives a sense of the data spread. One benefit of a boxplot is that potential outliers are easily identified. The lines on the boxplot are sometimes called "whiskers" and a boxplot is therefore sometimes called a "box-and-whiskers" plot. A steam-and-leaf plot is a completely different visual display from a boxplot.

Answer 23

Average 9.66 9.83. 10.81 Standard Dev 0.88 4.02 2.96 Coefficient oCV) 9.1% 40.9% 27.4% The CV is lowest for food, indicating a small amount of variability. Because of this, the mean is a good estimate and a simple average may be used.

Answer 24

F. Choices B and C Many relationships are linear. In addition, though it is always best to appropriately identify the proper functional form, if the true relationship is linear and the functional form used is non linear, the misspecification causes greater errors than if the true relationship is non-linear and is approximated by a linear form.

Answer 25

D. Choices A and B When a power function is plotted in unit spact, the exponent of the equation is the slope of the line in log space. Exponential functions not power functions are appropriately plotted on a semi-log graph.

Answer 26

C. 20.13 The weighted average ensures that the average appropriately accounts for the quantities associated with each cost. For this data set, it is a better representation of avearge cost than the simple average ofthe cost data would indicate. SUMPRODUCT() is a handy Excel formula to use for weighted averages. 1328.543527 total cost 66 total units 20.12944738 average unit cost

Answer 27

C. 20% Mean = 18.98 Std Dev = 3.824903035 CV = 0.201562168 The Coefficient of Variation (CV) expresses the standard deviation as a percent of the mean

Answer 28

C. Mode The example of a qualitative or categorical data set given in the module is the color of cows. The mean or median color of a cow is meaningless. The mode - the value that occurs the most often - is a more useful measure in this case. The mode is a visual parameter, not a mathematically useful one.

Answer 29

False False. The direction of the skew is defined by the "tail" of the data, not the "hump," so this distribution is skew left (or skewed left), meaning the extreme values are more on the left side than the right. This is the case when the mean (red line) is to the left of the median (blue line).

Answer 30

True. Statistical tests can be used in univariate data sets. Prime examples are the t test for mean, chi square test for variance, and Kolmogorov-Smirnov (K-S) and chi square tests for distribution.

Answer 31

False. Just as important as a clear statistical relationship is that your relationships make good engineering sense. It is possible to find statistical relationships in data that are meaningless for predictive use.

Answer 32

G. A, B and C If the distribution of the data is symmetric, the mean and the median will be equal. If, however, the distribution is skewed, the median can be either greater than or less than the mean, depending on the direction of the skewness.

Answer 33

False Understanding the presentation of the data has a great deal to do with the choices an analyst makes for the display. It is important to consider color, scale and space when adjusting these displays.

Answer 34

describing data types to be analyzed, providing calculations for performing the analysis, presenting techniques for visually representing the results, and applying various machine learning algorithms for advanced analysis.

Answer 35

measures of central tendency such as mean, median, and mode; measures of dispersion including variance, standard deviation, and coefficient of variation (CV); functional forms as well linear, power, exponential, and logarithmic equations; and supervised and unsupervised machine learning algorithms.

Answer 36

Many of the concepts regarding central tendency and dispersion introduced in this module are explored further in Module 10 Probability and Statistics.

Answer 37

Exploratory Data Analysis is a term used to describe the technique of analyzing and investigating the data set and summarizing the main characteristics.

Answer 38

Univariate data consists of a single variable such as cost data for a single element or a set of historical cost growth factors for various programs in a given phase. It can be displayed graphically using histograms, bar graphs (shown in figure 6.1), or boxplots. Descriptive statistics (e.g., mean, median, standard deviation, CV) should be used to find the central tendency and dispersion of the data. Inferential statistics can be used to assess whether or not the data set seems to match a certain mean, variance, or distribution, but this is not often done with univariate data. True univariate data is quite rare in cost estimating. Bivariate or multivariate data emerges when data is categorized or when other metadata is added. Bivariate data has one independent variable and one dependent variable. An example of bivariate data is software development cost as a function of lines of code. It is generally displayed using a scatter plot. In this case, use both descriptive and inferential statistics (e.g., regression, t {\displaystyle t}, and F {\displaystyle F} statistics). Descriptive statistics find the central tendency and dispersion of the dependent variable, and inferential statistics test the relationship between the independent and dependent variable. Multivariate data has several independent variables and one dependent variable. An example of multivariate data would be the dependent variable of the cost of ship supplies as a function of two independent variables: crew size and hours underway. A 3-D plot or numerous pairwise plots of the dependent variable against the various independent variables can be used to display multivariate data. Time series data differs from univariate, bivariate, and multivariate data and requires a different approach. Time series data is generally a bivariate data set with time as the independent variable. For example, consider cost growth as a function of years since program initiation, or consider worker productivity measured by quarter. As with any bivariate data set, time series data can be plotted on a standard Cartesian x y {\displaystyle xy}-plane with time intervals plotted on the x {\displaystyle x}-axis. Use judgment to plot time scales, ensuring proper distribution of data points. Note that some software functions will force the even spacing of the data points, resulting in a poor and misleading distribution. Unlike other data types, smooth trends are rarely observed in time series data. Instead, there are irregularities due to paradigm shifts, cycles, or autocorrelation. Table 6.1 defines several key terms used with time series forecasting. Regression analysis identifies smooth trends. Regression will not detect paradigm shifts and cycles in time series data, but scatter plots and moving averages can help identify them. Dividing the data into subgroups (e.g., 1980-85, 1985-90, 1990-95, 1995-2000, and 2000-05) enables comparison of descriptive statistics. The Analysis of Variance (ANOVA) can be used to test for statistically significant differences between these subgroups. ANOVA is a statistical method in which the variation in a set of observations is divided into distinct components.

Answer 39

Indicates a marked change in the nature of the data occurring at some point or over some period. An example of a paradigm shift would be lower cost growth in programs entering production due to a specific change in acquisition law.

Answer 40

Repeating periodic trends and can occur at any interval. They are often found in seasonal data (e.g., higher electricity costs in winter and summer because of increased heating and cooling costs). Maintenance actions such as ship overhauls are another example of cyclical data.

Answer 41

Present when the variable (e.g., health, weather) in time t {\displaystyle t} is correlated to the variable in time t − 1 {\displaystyle t-1}, which is correlated to the variable in time t − 2 {\displaystyle t-2}. In other words, the value of the variable in the present is correlated to the value of the variable in the previous time period(s). Autocorrelation occurs due to dependencies within the data, usually when the data is from the same source.

Answer 42

Some important descriptive statistics include: sample size (i.e., the number of data points selected for analysis), mean, standard deviation, Coefficient of Variation (CV), and specialized averages.

Answer 43

Statistical software programs can calculate a standard group of descriptive statistics for a given data set. It is better, however, to use the formula for each statistic separately, as these individual formulas, unlike the results of a macro, will update automatically if there is a change in the data. Note that a representative sample size is needed to make confident, meaningful, comparative conclusions. Graphical displays such as bar charts and histograms provide a visual way to compare descriptive statistics.

Answer 44

When the data falls more than three standard deviations outside the mean, it should be reviewed to determine if it should be included. Since 99.7% of probability density falls within three standard deviations of the mean for a normal distribution, data outside this range is rare. Note that in addition to this common rule, analysts use a variety of other methods to identify outliers While the normal distribution is generally a good first approximation, significantly skewed data should not use these heuristics to identify potential outliers. he presence of outliers in the data can bias the results of the regression analysis. Figure 6.3(a) demonstrates data from a database which includes a potential outlier that is 4.24 standard deviations from the mean. Since the likelihood of having a point more than three standard deviations away is only about 0.3%, and four standard deviations even rarer, this point is definitely a possible outlier. Compare the result when the point is removed in figure 6.3(b): the slope of the regression line is steeper, and the R 2 {\displaystyle {\textrm {R}}^{2}} value is higher. While it is important to be aware of the impact of outliers on the regression analysis, outliers must never be removed without justification and documentation. Analysis of outliers can provide insight into potential data issues. Furthermore, variation may be naturally present, and removing outliers without adequate reason unjustifiably alters the data. In the extreme case, iteratively removing the data points farthest from the mean can result in a data set with no variation at all. Programs that are restructured or re-baselined midstream may be true outliers. For example, data from selected acquisition reports for the F-14A/D may contain data from the F-14D only. Removing the F-14D data is justified as there are several differences in the D variant. Also, programs that do not match the characteristics of the data set may be removed (e.g., a data point representing a helicopter does not belong in a set of missile data or, less extremely, an air-to-air missile may not belong in a data set of surface-to-air missiles). In general, outliers may be indicative of some of the data issues discussed in Module 4 Data Collection and Normalization and should prompt the analyst to explore the data set further.

Answer 45

Compare results to historical data whenever possible to validate the data. One example of a source of historical factors available to U.S. Navy analysts is the standard factors handbook available from the Naval Center for Cost Analysis (NCCA). Such factors and heuristics exist for various services, commodities, and organizations. Doing these comparisons will add credibility and validation. If heuristics or standard factors are not available, a Subject Matter Expert (SME) or stakeholder review will often help to eliminate inappropriate methodologies or erroneous results.

Answer 46

What does it look like? Some useful visual tools for data analysis are scatter plots, histograms, stem and leaf plots, and box plots. What is the best guess? Measures of central tendency, such as the mean, median, or mode, are useful when describing best guess or probable outcome of a data set. These measures are single points used to represent the total data set. How much remains unexplained? Variability, or mean, around the point estimate must be addressed. Standard deviation (i.e., variance) and the CV are common tools for measuring the variability. How precise is it? Confidence intervals are used to measure the certainty or uncertainty around the point estimate. Confidence intervals are discussed in depth in Module 10 Probability and Statistics. How can certainty be ensured? Basic statistical tests are introduced in this module. Advanced topics are presented in Module 10 Probability and Statistics. This analytical framework is repeated in the topics of bivariate and multivariate analysis discussed in Module 8 Regression Analysis.

Answer 47

Mean The most popular measure of central tendency is the mean. The arithmetic mean (i.e., average) of a data set is the sum of the data values divided by the number of data points. For example, a sample data set has n number of values: x1 x2 x3... To calculate the mean, sum the x values and divide by the count of those values: X = (250) + (244) + (221) + (280) / 4 Often, the mean is not an actual value in the data set. Instead, it is a model of the data set and includes every value as part of the calculation. Because of this feature, the mean is especially sensitive to the influence of outliers. When outliers exert undue influence on the mean, consider using the median as the measure of central tendency. Median The median is the middle data point such that exactly half of the remaining data points are lower than the median and half are higher than the median. When calculating the median, order the data from the lowest to the highest value. The median is the middle data point such that exactly half of the remaining data points are lower than the median and half are higher than the median. If there is an odd number of data points, the median is simply the middle value. For example, the median of the data set {2,5,7,9,25} is 7. If there is an even number of data points, the two middle values are averaged to get the median. The median of the data set {3,6,8,11,13,30} is 9.5 (i.e., the average of 8 and 11). When there are outliers, the median may be more representative of a data set than the mean. For example, the median is the most appropriate measure of central tendency in the case of reporting family income because it is not distorted by income extremes in the population. By contrast, the mean is not representative of the average family because it includes extremes (e.g., the destitute and billionaires). In reporting median income, half of all families earn more than the median income and half earn less. Mode The mode is the most frequently occurring point and is the least used of the three measures of central tendency. The mode is often most useful in discrete sets, particularly qualitative or categorical ones and can be shown graphically using histograms. For example, the mean color of cows is useless, but the mode may be black-and-white piebald. This example is a most-frequent and discrete-case example, which makes the mode a good measure to use. A quantitative discrete example would be a roll of a pair of dice, where the sum of seven appears most frequently. Since the mode is used to determine whether or not a value occurs frequently in a data set, if the most frequent occurrence of a data point is infrequent, the predictive value of the mode is low. For distributions, the mode of the distribution is its peak. This is the value where it attains its greatest probability mass in the discrete case or its greatest probability density in the continuous case

Answer 48

he sample number of data points is n the number of data points that can vary independently from the mean is (n−1), since the nth data point is exactly determined by the mean and the values of the other data points. This concept is called degrees of freedom. This denominator also ensures that the sample standard deviation is an unbiased estimator of the population standard deviation. The standard deviation of a distribution is calculated as the square root of the variance and measures the absolute distance of the data points from their mean:

Answer 49

If the median is lower than the mean, as illustrated by the lognormal distribution in figure 6.4(a), the distribution is said to be skewed right or skew right, since it stretches out to the right. A median higher than the mean, as illustrated by the beta distribution in figure 6.4(c), indicates that the distribution is skewed left or skew left. Notice that the direction of skew follows the tail of the data and not the hump.

Answer 50

Mesokurtic: The kurtosis of a normal distribution, often set as a baseline with a kurtosis value of 0. The distribution has a normal tail that follows the shape expected by a standard bell curve. Leptokurtic: Kurtosis higher than 0, demonstrated by fatter tails and a sharper peak. More data is found in the tails, indicating a higher likelihood of extreme values (e.g., a Laplace distribution). Platykurtic: Kurtosis less than 0, demonstrated by thinner tails and a flatter peak. More data is found evenly distributed around the mean, and extreme values are less likely (e.g., the uniform distribution).

Answer 51

Low CV indicates less dispersion (i.e., tighter data). In cost estimating, a CV of less than 15% is desired. In practice, a low CV (e.g., 5%) would indicate that the mean of the cost data is a useful description of the data set. If the CV is much higher (e.g., greater than 15%) there could be a cost driver in the data set that causes the cost to vary. This situation should prompt the analyst to develop CERs in order to find the cost driver. If after running CERs the CV is not significantly reduced, the cost driver may have been incorrectly identified. Some data are inherently more noisy than other data, so the lack of noise reduction may be the fault of the data rather than misidentification of the cost driver.

Answer 52

The data sets that are shown in figure 6.6 illustrate this point. Both sets have the same mean, however, the data in figure 6.6(a) is more tightly distributed around the mean, while the data on the right shows more dispersion. While both data sets have the same expected value (mean), a much wider range is expected when predicting future values based on the data i

Answer 53

Visualization tools reveal data distribution, patterns, trends, outliers, and missing data. Graphs can enable understanding; they make apparent what may otherwise be lost in the sea of data.

Answer 54

help to detect trends in the data, possible shifts, or potential outliers; generate a visual representation of the data using commonly available software programs; and accommodate trend lines to help link the graph to the regression equation. When adding a trend line, display the R2 value and regression equation. Note that relationships must be tested for statistical significance further in the analysis process.

Answer 55

Performing analysis with scatter plots is the first step in identifying the most relevant cost drivers. Cost drivers are discussed in depth in Module 3 Parametric Estimating. The three scatter plots shown in figure 6.7 graph cost as a function of three different variables. Cost and Variable 1 have the strongest correlation as shown in figure 6.7(a). Therefore, Variable 1 is a potential cost driver. Cost and Variable 2 shown in figure 6.7(b) are weakly correlated, and cost and Variable 3 in figure 6.7(c) are uncorrelated. Keep in mind that the R2 value is merely an indicator of the correlation between two variables. The t- and F-statistics must be checked to determine the statistical significance of the relationship. These statistics are discussed in more detail in Module 8 Regression Analysis.

Answer 56

All data can be plotted in unit space, the standard Cartesian xy-plane. Variable x is the independent variable plotted on the horizontal axis and y is the dependent variable plotted on the vertical axis. All analysis can be done in unit space if the data is linear. Linear data is preferred to enable the use of linear regression to find CERs. However, many relationships are non-linear. If the data is non-linear, it is easy to see in unit space that the data does not approximate a line. It is much more difficult to assume non-linear data; the analyst should look at the potentially non-linear data in a transformed space to determine if the data becomes linear. In many cases of non-linear data, this transformation can be made to make the data linear, allowing the analyst to perform linear regressions on the transformed data. The analyst can then transform the results back to unit space. Module 8 Regression Analysis discusses how to proceed with this transformation. When the correct transformations are applied, most cost driver relationships can manifest themselves as a linear trend. Linear functions are therefore the natural starting point when analyzing data for potential CERs.

Answer 57

Choose the bins that the data is separated into carefully. The two histograms shown in figure 6.11 use the same data set of univariate cost data for condominium monthly natural gas bills but with different bin sizes. The histogram in figure 6.11(a) allows spreadsheet software to automatically choose the number and size of bins. Almost all of the data ends up in one bin and does not provide a good idea of the distribution. In the histogram in figure 6.11(b) the analyst specified the number and size of bins to be used. The bar labeled 30 indicates the frequency of gas bills that are greater than or equal to $15.00 but less than $30.00. A simple average for the monthly gas bills was estimated to be $34.19. A skew-right distribution and potential outliers become evident, leading the analyst to investigate fitting a triangular, lognormal, or even exponential distribution to the data. The data points on the far right appear to be possible outliers. In this case, however, the lowest bin specified is $15 which obscures the fact that there is only one bill less than $11.15. It is no longer clear that the distribution has a hump. The bins in figure 6.11(b) hide or distort the data because they create the impression that the distribution is exponential, when in fact, the data show a void under some value. In other words, there are few-to-no near-zero months. If the distribution were truly exponential, lower values around zero would be the most common. Thus, poor choices of histogram bins can hide important information.

Answer 58

The bar chart in figure 6.12 displays the mean Cost Growth Factor (CGF) of the Development Estimate (DE), computed as a dollar-weighted average, for work performed by several different companies. As the chart title indicates, these data come from Selected Acquisition Reports (SARs) for developmental programs with an Engineering and Manufacturing Development (EMD) phase only. See Module 9 Cost and Schedule Risk Analysis for more information on SAR data. Standard deviations are shown for each group as a dashed yellow line. Note that these are standard deviations, not confidence intervals for the means. To get confidence intervals for the mean, divide the standard deviation by the square root of n. From this graph, it is apparent that Company 1 has the highest mean cost growth. The fact that Company 4 is a marginally better supplier than Company 3 is also apparent from the graph because the two values of mean and standard deviation are combined in a meaningful way. The mean for Company 4 is less than Company 3, and there is less variation in Company 4's cost growth. This information is discernible from the bar chart but much harder to determine from just looking at the data. Inferential statistical tests (in this case, t-tests) can be performed (see Module 10 Probability and Statistics), but a good graphical display like figure 6.10 can be effective in convincing the viewer

Answer 59

Box plots display the five number summary: minimum, first quartile, median, third quartile, and maximum. In a typical box plot like the one shown in figure 6.13, the rectangle spans the range between the first and third quartile. The median is shown inside the rectangle, and the minimum and maximum values are displayed at the ends of the whiskers. Outliers are also plotted. The box plot in figure 6.14 shows exactly the same data as the histogram in figure 6.12. This box plot demonstrates that 25% of the time the data falls between $11 and $12. The median point is $16.50 indicating that 25% of the data is between $12 and $16.50. To the right of the median and contained within the box (from $16.50 to $37.00) is the next 25% of the data set. Finally, 25% of the data is to the right of the majority of the data set. On the far right are potential outliers. Given that this data set is the cost of monthly gas bills, is there an explanation for the potential outliers? Most of December, January and February bills are 4-5 times the cost of gas bills in the summer months; perhaps it was bitter cold during those winter months, or perhaps the owner stayed home all day and therefore consumed more. The important thing is to understand why these points appear to be outliers. Again, appropriate treatment of outliers was discussed in Outliers.

Answer 60

In the example show in figure 6.15, monthly precipitation for Seattle is plotted against buckets of rainfall accumulation for every day from 1998 to 2018. The darker shaded cells represent higher counts of total days with those levels of accumulation. For each month, we can quickly see what accumulations occurred more frequently. When used to illustrate correlation, heat maps can be referred to as a correlogram, which replaces each of the variables on the two axes with a list of numeric variables in the data set. Each cell depicts the relationship between the intersecting variables, such as a linear correlation. These are often used during the exploratory stage of data analysis, when assessing the correlation between independent and dependent variables, or assessing whether multicollinearity may exist between independent variables that are candidates for a parametric cost estimating relationship. In figure 6.16, the correlation between three variables for Irises are shown. By looking at the chart, for example, you can see the petal width and sepal length and petal width and petal length are highly correlated.

Answer 61

This can be particularly useful in cost estimating when highlighting changes between two estimates (i.e., baseline estimate and an update or what if case). Figure 6.17 shows two examples of waterfall charts. The first chart (figure 6.17(a)) demonstrates the change in program Estimate at Completion (EAC) over time, providing a unique visual to supplement earned value analysis. Beginning on the left, the first column shows the EAC as estimated in September of 2022. Moving right, we can see the changes in the EAC estimate by CLIN from September 2022 to August 2024. In figure 6.17(b), the forecasted change in inventory for parts is shown each year beginning in 2021 and moving through 2032. This chart supplements failure analysis to help estimate the cost of replacement parts through the program life cyc

Answer 62

It is important to have a basic understanding of machine learning to recognize both the opportunities and limitations associated with utilizing these tools. There are three primary techniques of machine learning which will be introduced: supervised, unsupervised, and reinforcement. A key difference between supervised and unsupervised learning is that supervised learning primarily analyzes labeled data, while unsupervised learning analyzes unlabeled data. A summary of the ML topics discussed in this module are illustrated in figure 6.18.

Answer 63

Supervised learning uses a data set with input variables and output variables from which it is trained to learn the relationship between the input and output data. Then once this algorithm is given new input data, it will predict an outcome for the unknown output data. Supervised learning algorithms fall into two main categories: classification (predicting categorical values) and regression (predicting numerical values).

Answer 64

Consider the following dataset in figure 6.19 of spacecraft payloads for a large space program. This table provides notional information on weight, power, payload class, and cost value. This dataset has been randomly generated for the purpose of ICEAA to represent a real-life dataset a cost analyst working on space systems might encounter. By plotting the data with weight on the x-axis, and power on the y-axis, the difference between classes can be visualized (figure 6.20); Class A/B instruments tend to be heavier and consume more power, while Class C/D instruments are lighter and use less power. To further illustrate the concept, imagine that the dataset now contains 1000 points for each class. Now it is clearly visible in figure 6.21 that the Class A/B instruments have a different mean and standard deviation than the Class C/D instruments in both Weight and Power (x and y axes

Answer 65

where the probability of A given B is calculated as the probability of B given A times the probability of A divided by the probability of B. In this example (see table 6.2), we calculate the mean and standard deviation of each group (Class A/B instruments, and Class C/D instruments). For a known weight and power, we can predict the probability of belonging to each class based on the historical distribution of each, using Z-scores and the normal distribution.

Answer 66

This decision tree classifier has been fit to the data and generates two splitting points, which can be read as the following statement: If power < 1046, then “C/D”, else if weight < 1482 then “C/D”, else “A/B”. In this case, new data points with a power < 1046 and weight < 1482 would be predicted as Class C/D, and anything over 1046 Watts or 1482 kg would be predicted as Class A/B. This forms a boundary similar to the Naïve Bayes classifier, but with different methodology.

Answer 67

Ensemble models help overcome technical challenges with single estimators and can improve the overall prediction power of the model. Random forest algorithms are a specific type of ensemble model that work by creating multiple decision trees during training and then combine the outputs to make more accurate predictions. Each tree in the forest learns from a random subset of data and features. The resulting aggregated forest has better generalization and accuracy, helping to reduce model overfitting that might occur in an individual decision tree.

Answer 68

To do so, the logistic function is applied to a linear combination of input features to transform predictions into probabilities between 0 and 1. For a cost analyst, logistic regression can help make binary cost decisions, classify cost risks, and provide feature importance information via the coefficients assigned to each variable.

Answer 69

Several regression methods and techniques such as linear regression, non-linear regression, and multivariate regression are explored in more detail in Module 8 Regression Analysis. In the context of regression analysis in ML, the following algorithm techniques are introduced: linear regression, decision trees, random forests, and kNN. All these regression techniques can be used to predict a numerical outcome, and there is no best method to apply. Each type has their own strengths and weaknesses depending on the data being assessed.

Answer 70

Univariate regression for classification examines the relationship between a single predictor variable and an outcome to predict a continuous target. For classification, these continuous predictions map to class labels, generally by setting threshold values. In other words, if a prediction falls below a certain threshold, the algorithm assigns the data point to one class, and if it falls above the threshold, to a different class. Note: Logistic regression is a form of univariate regression that deals with binary classification and models probabilities rather than continuous values. In cost estimating, consider the example visualized in figure 6.33: there is a single significant predictor, project duration, that has been observed to influence project costs. The analyst uses univariate regression on project duration to predict a continuous cost variable and then classifies the predicted cost into different cost categories (low, medium, and high) based on thresholds that reflect historical cost ranges. Multivariate regression for classification considers multiple predictor variables to estimate an outcome and then maps the continuous outputs to class labels. Techniques like Linear Discriminant Analysis (LDA), which deals specifically with continuous independent variables and a categorical dependent variable to reduce the dimensions of a data set, and polynomial regression, which models the relationship between the dependent variable and independent variables as n-degree polynomials, are special forms of multivariate regression that leverage multiple predictors for separating classes based on feature combinations. In cost estimating, consider the example visualized in figure 6.34: both project size and materials used affect the overall cost of a project, and these predictors can be used in multivariate regression with the outcomes then mapped to different categories of risk for budget overruns (low, medium, high) or into probabilities for likely to exceed budget or not likely to exceed budget (figure 6.35).

Answer 71

Rather than predicting a category when used for classification, regression decision trees predict a numerical value. The following example in figure 6.36 illustrates how a decision tree can predict spacecraft cost based on weight (kg) and design life (duration in years)

Answer 72

In this approach, each tree is made using a random subset of variables. In practice, this approach commonly uses 100 or more trees and is generally more accurate than using a single decision tree. However, this approach can be computationally expensive. Figure 6.37 illustrates an example of a random forest prediction.

Answer 73

Essentially, supervised learning leverages a labeled dataset to recognize patterns within predefined categories, while unsupervised learning analyzes unlabeled data to discover and form categories based on inherent similarities in the data. Examples of this include recommendation engines such as Netflix, Amazon, and Spotify. There is no correct movie or song to choose for you next; the algorithm finds ones that are similar to those that you have watched or liked in the past. This recommendation engine example utilizes a common algorithm in unsupervised learning: clustering ( k. Neural networks are another common unsupervised learning technique.

Answer 74

In cost estimating, neural networks can classify projects by learning complex, non-linear relationships between variables and cost-related outcomes. The model can train on historical cost data to learn patterns between multiple features. Suppose an analyst has a dataset of project features: project duration, project size, material and labor costs, and region. Each project has a label of either within budget, over budget, or significantly over budget. A neural network can train on all project features using the known budget labels to learn output probabilities for each budget class based on project characteristics. New projects can be fed into the model, and a probability is returned for each budget class. The model then selects the highest probability for classification. Figure 6.39 shows the neural network architecture diagram for this example.

Answer 75

no supervisor to guide the training; no training with a large pre-collected dataset (data is provided dynamically via feedback from the real-world environment in which you are interacting); iterative decision making over a sequence of time-steps where inferences are run repeatedly, navigating through the real-world environment as you go. Application of reinforcement learning is most used for control task or decision task in systems that interact with the real world. In cost estimating, this technique can make sequential decisions by interacting with an environment and receiving rewards or penalties based on the outcomes of its actions. An analyst can use this process to help optimize the allocation of resources to different projects. Analysts can also use reinforcement learning agents to monitor project changes and dynamically adjust cost estimates to reflect current conditions. The technical aspects of implementing this technique are considered advanced and are beyond the scope of this module.

Answer 76

availability heuristic, or overestimating events which have greater memory to the SME; conservatism, or insufficiently revising one's belief when presented with new data; Dunning-Kruger effect, where a SME may overestimate their ability; optimism bias, which leads to insufficiently assessing risk or true cost; and planning bias, or underestimating how long a task requires for completion. Without validation, data analysis, cross-checks, and documentation, the opinion holds little value to the customer. The cost analyst must make any assumptions formed from expert opinion explicit so that the estimate becomes reproducible. See more on expert opinion in Module 2 Costing Techniques. For more on this explicit documentation in the form of a Basis Of Estimate (BOE), see Module 14 Contract Pricing.

Answer 77

In this example, the data for four ships shown in figure 6.42 are analyzed for hours per ton required for each ship's shakedown (i.e., sea trial) where each ship's name corresponds to its hull tonnage (e.g., the tonnage for the hull of ship DD 963 is 963 tons). The program requires an estimate for the 5th ship, DD 967. From the historical data, the technical expert thinks that the factor 0.29 for DD 963 is too low for a first ship and discards this point as an outlier. Apply an objective method by scatter plotting the data to validate this assumption. Test the technical expert assumption by analyzing the plot in figure 6.43. Including DD 963, it is clear that all points except DD 980 fall along a straight line. Without DD 963, the trend line has such a steep slope that the resulting hours/ton from the curve is unrealistic at the 5th ship DD 967. It appears that the expert may have rejected the wrong outlier. The cost estimator should determine whether the expert may have correctly rejected the data point for DD 963 due to factors not known to the cost estimator as well as investigate the data point for DD 980. This process provides credibility to the analysis by testing the assumptions provided by the expert for validity. Outliers should not be removed unless they are truly unrepresentative of the data.

Answer 78

Good data analysis begins with good data. A successful analyst has knowledge of the data, understands the processes and techniques of analysis, and possesses the ability to think critically and objectively about results. Graphics are the best presentation form of data analysis and include scatter plots, histograms, and bar charts. Understanding the types of data available and data characteristics enable the analyst to properly select the right analytical methods. Finally, ML algorithms provide additional layers of data analysis and innovative ways of classifying and grouping project data.

Answer 79

Chauvenet’s Criterion assumes a normal distribution and takes the tail probability and multiplies it by the number of data points, yielding the expected number of such extreme points. If fewer than one-half points were expected, or equivalently if the tail probability is less than 1/(2n), then the value may be considered an outlier. The formula may be tailored by replacing the phi function with the cumulative distribution function (cdf) of the desired distribution. Grubbs’ Test assumes a normal distribution and calculates a test statistic, G, as the maximum absolute deviation from the sample mean, divided by the sample standard deviation minus a Z score of sorts. If that test statistic exceeds the algebraic mess shown, then the null hypothesis of no outliers is rejected. Dixon’s Q Test uses a test statistic equal to the gap (i.e., distance between the prospective outlier and its nearest neighbor) divided by the range of the data set. A table of critical values must be used to evaluate the test. The IQR-Based approach is similar to that used in the boxplots presented in Box Plots. Outliers are identified as those points more than a certain multiple of the interquartile range (IQR) outside that range (i.e., below the first quartile or above the third

Unit III - Module 6 - Basic Data Analysis Principles Flashcards

(103 cards)