Module 6 - Basic Data Analysis Principles Flashcards

(96 cards)

1
Q

List the core analytical methods of cost estimating discussed in this module.

A
  • Measures of central tendency (mean, median, mode)
  • Measures of dispersion (variance, standard deviation, CV)
  • Functional forms (linear, power, exponential, logarithmic)
  • Machine learning algorithms (supervised and unsupervised)

These methods are essential for understanding and analyzing cost data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does Exploratory Data Analysis involve?

A

Analyzing and investigating the data set to summarize main characteristics

It marks the first step in developing a cost or risk estimate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define univariate data.

A

Data consisting of a single variable

Examples include cost data for a single element or historical cost growth factors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the difference between bivariate and multivariate data?

A

Bivariate data has one independent and one dependent variable; multivariate data has several independent variables and one dependent variable

Examples include software development cost (bivariate) and cost of ship supplies based on crew size and hours underway (multivariate).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is time series data?

A

A bivariate data set with time as the independent variable

It requires different analytical approaches compared to univariate or multivariate data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the purpose of regression analysis?

A

Identifying smooth trends in data

It does not detect paradigm shifts or cycles in time series data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are outliers?

A

Data points that fall far from the central mass of the data

They can distort both descriptive and inferential statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

True or false: Outliers should always be removed from data sets.

A

FALSE

Outliers must never be removed without justification and documentation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the mean?

A

The arithmetic average of a data set

It is sensitive to outliers and may not represent the data set accurately.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is the median calculated?

A

Order data from lowest to highest and find the middle value

If there is an even number of data points, average the two middle values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the mode?

A

The most frequently occurring value in a data set

It is least used among the measures of central tendency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the Coefficient of Variation (CV)?

A

A measure of dispersion that indicates the ratio of the standard deviation to the mean

It is useful for comparing the degree of variation between different data sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the significance of standard deviation?

A

It measures the amount of variation or dispersion in a set of values

It is crucial for understanding the spread of data points around the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does data validation involve?

A

Examining descriptive statistics, assessing outliers, and comparing historical results

It ensures the credibility of the data used in analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a paradigm shift in data analysis?

A

A marked change in the nature of the data occurring at some point or over some period

An example includes lower cost growth in programs due to changes in acquisition law.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are cycles in data analysis?

A

Repeating periodic trends often found in seasonal data

Examples include higher electricity costs in winter and summer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is autocorrelation?

A

When a variable is correlated to its past values

It indicates dependencies within the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

True or false: The mode is the most commonly used measure of central tendency.

A

FALSE

The mode is the least used of the three measures of central tendency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What does a lower variance indicate?

A

Less dispersion or spread of data

A lower variance means the data points are closer to the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the formula for sample variance?

A

s² = ∑(xi - x̄)² / (n - 1)

This formula ensures that the sample standard deviation is an unbiased estimator of the population standard deviation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the standard deviation?

A

The square root of the variance

It measures the absolute distance of data points from their mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does it mean if a distribution is skewed right?

A

The median is lower than the mean

This indicates that the distribution stretches out to the right.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is kurtosis?

A

A measure that describes the shape of a distribution’s tails

It indicates whether the tail contains extreme values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does a low Coefficient of Variation (CV) indicate?

A

Less dispersion in the data

A CV of less than 15% is desired in cost estimating.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is **data visualization**?
The principle of displaying data sets in an easily understood manner ## Footnote It helps reveal data distribution, patterns, trends, and outliers.
26
What is the purpose of a **scatter plot**?
To graph two variables along two axes ## Footnote Scatter plots help detect trends, shifts, or potential outliers in data.
27
In cost estimating, the **dependent variable** is generally graphed on which axis?
y-axis ## Footnote The independent variables that drive cost are graphed on the x-axis.
28
What does the **R² value** indicate in regression analysis?
The correlation strength between two variables ## Footnote It represents a reduction in variability for predicting the dependent variable.
29
What is the formula for **Pearson's median skewness**?
3 * (Mean - Median) / Standard Deviation ## Footnote This measures the standard deviations separating the mean and median.
30
What does the **R² value** indicate in cost estimation?
It indicates the proportion of variance explained by the model ## Footnote Stakeholders may expect a high R² value, but cost estimation typically does not yield such high values.
31
In scatter plotting, what is plotted on the **horizontal axis**?
Independent variable (x) ## Footnote The dependent variable (y) is plotted on the vertical axis.
32
True or false: **Linear data** is preferred for using linear regression to find CERs.
TRUE ## Footnote Linear data allows for straightforward analysis and regression.
33
What transformation can be applied to **non-linear data** to analyze it?
Transform the data to make it linear ## Footnote This allows the analyst to perform linear regressions on the transformed data.
34
What is the basic form of a **linear function** equation?
y = a + bx ## Footnote Here, 'a' is the y-intercept and 'b' is the slope.
35
What type of functions are commonly found in **cost analysis**?
* Linear functions * Power functions * Exponential functions ## Footnote Each type has specific characteristics and applications in cost estimation.
36
What does the **power function** equation look like?
y = ax^b ## Footnote This equation can be transformed to show a linear relationship in log space.
37
What is the result of transforming the **power function** equation?
ln(y) = ln(a) + b ln(x) ## Footnote In this transformation, 'b' is the slope and 'ln(a)' is the y-intercept.
38
What is the basic form of an **exponential function** equation?
y = ae^(bx) ## Footnote This can also be expressed as y = ak^x, where k = e^b.
39
What happens when **b** in the exponential function is greater than zero?
The function is exponentially increasing ## Footnote Conversely, if b is less than zero, the function is exponentially decreasing.
40
What is the purpose of a **histogram**?
To show the density of univariate data ## Footnote Histograms group data into bins and display the frequency or relative frequency.
41
What is a key consideration when choosing **bins** for a histogram?
The number and size of bins ## Footnote Poor choices can obscure important information about the data distribution.
42
What do **bar charts** represent?
Categorical data as rectangles ## Footnote The height or length of each rectangle corresponds to the value for each category.
43
What is the mean **Cost Growth Factor (CGF)** computed as?
A dollar-weighted average ## Footnote This is used for comparing descriptive statistics among different groups.
44
What does **Cost Growth Factor (CGF)** represent?
A dollar-weighted average for work performed by different companies ## Footnote Computed from Selected Acquisition Reports (SARs) for developmental programs with an Engineering and Manufacturing Development (EMD) phase.
45
What are the **five number summary** components displayed in a box plot?
* Minimum * First quartile * Median * Third quartile * Maximum ## Footnote Box plots summarize the distribution of a data set and also plot outliers.
46
True or false: **Box plots** and histograms can represent the same data.
TRUE ## Footnote Box plots can demonstrate the same data distribution as histograms, providing a different visual perspective.
47
What is the primary use of **heat maps** in cost analysis?
To analyze correlation between variables ## Footnote Heat maps depict values for a main variable across two axis variables using a grid of colored squares.
48
What is a **waterfall chart** used for?
To separate pieces of a stacked bar chart and show changes over time ## Footnote Useful in cost estimating to highlight changes between estimates.
49
Define **Machine Learning (ML)**.
A field of artificial intelligence where algorithms learn from data to make predictions ## Footnote ML involves statistical, mathematical, and numerical techniques.
50
What are the **three primary techniques** of machine learning?
* Supervised learning * Unsupervised learning * Reinforcement learning ## Footnote These techniques differ in how they analyze data.
51
What is the difference between **supervised** and **unsupervised learning**?
Supervised learning analyzes labeled data; unsupervised learning analyzes unlabeled data ## Footnote This distinction is crucial for understanding how each method operates.
52
What are the two main categories of **supervised learning**?
* Classification * Regression ## Footnote Classification predicts categorical values, while regression predicts numerical values.
53
Name **five common algorithms** used in classification.
* Naïve Bayes * Decision tree * Random forest * k-Nearest Neighbors * Logistic regression ## Footnote These algorithms are used to predict the label or class of unseen input data.
54
What does the **Naïve Bayes** classifier rely on?
Joint probability distributions ## Footnote It uses Baye's theorem to calculate probabilities for predictions.
55
In the context of **Naïve Bayes**, what does the formula P(A|B) represent?
The probability of A given B ## Footnote Calculated as the probability of B given A times the probability of A divided by the probability of B.
56
What is the significance of **outliers** in data analysis?
They can indicate unusual data points that may require special treatment ## Footnote Understanding outliers is crucial for accurate data interpretation.
57
What is the purpose of **inferential statistical tests**?
To make inferences about populations based on sample data ## Footnote These tests help in understanding the relationships and differences within data.
58
What is the **Naïve Bayes classifier** used for?
Classification ## Footnote It computes probabilities for data points based on their features.
59
In the Naïve Bayes classifier, what is the probability of a new point being class A/B if it is 90.83% likely?
90.83% ## Footnote This indicates a strong likelihood of the point belonging to class A/B.
60
What is the **decision tree** method used for in machine learning?
Classification and regression analysis ## Footnote It creates branches based on yes/no questions to split data.
61
True or false: Decision trees can only handle linear data.
FALSE ## Footnote Decision trees work well with nonlinear data grouped in various ways.
62
What does the **depth** parameter in decision trees control?
Maximum number of branch layers ## Footnote Adjusting depth can help prevent underfitting or overfitting.
63
What are **random forests** in machine learning?
Ensemble models that combine multiple decision trees ## Footnote They improve prediction accuracy and reduce overfitting.
64
List some applications of **random forests** for cost analysts.
* Predict future cost based on historical data * Assess risk via probability scores * Rank importance of variables * Detect anomalies * Provide what-if analysis ## Footnote Random forests enhance decision-making in cost analysis.
65
What does the **k** in k-Nearest Neighbors (kNN) represent?
Number of closest neighbors considered ## Footnote This tuning parameter influences the classification outcome.
66
How does **logistic regression** help analysts?
Estimates probability of binary outcomes ## Footnote It is useful for making binary cost decisions and assessing risks.
67
What is the equation for predicting probability in logistic regression?
P = 1 / (1 + e^-(β0 + β1X1 + β2X2 + ... + βnXn)) ## Footnote This equation transforms linear combinations of input features into probabilities.
68
What is the purpose of **regression** in machine learning?
Predict numerical outcomes ## Footnote It can be used for continuous variables like cost.
69
What is the difference between **univariate** and **multivariate regression**?
* Univariate: Single predictor variable * Multivariate: Multiple predictor variables ## Footnote Both can be used for classification by mapping continuous outputs to class labels.
70
What does **feature importance** indicate in logistic regression?
The significance of variables in predicting outcomes ## Footnote It is determined by the coefficients assigned to each variable.
71
What is the role of **tunable parameters** in machine learning?
Modify model complexity to fit data better ## Footnote Examples include depth in decision trees and k in kNN.
72
What does **multivariate regression for classification** consider?
* Multiple predictor variables * Estimating an outcome * Mapping continuous outputs to class labels ## Footnote Techniques include Linear Discriminant Analysis (LDA) and polynomial regression.
73
What is the role of the **Legislative Branch** in the U.S. government?
* Making laws * Consists of the House of Representatives and the Senate ## Footnote Together, they form the U.S. Congress.
74
In **cost estimating**, what two factors affect the overall cost of a project?
* Project size * Materials used ## Footnote These predictors can be used in multivariate regression.
75
What is the purpose of a **regression decision tree**?
* Predicting numerical values * Splitting data into yes or no questions ## Footnote It works well with nonlinear data in groups.
76
How does a **Random Forest** improve regression predictions?
* Combines multiple decision trees * Averages their results ## Footnote Typically uses 100 or more trees for accuracy.
77
What does the **kNN algorithm** predict in regression tasks?
* Continuous values * Based on average or weighted average of nearest neighbors ## Footnote It can also classify projects into categories based on thresholds.
78
True or false: **Unsupervised learning** tries to predict a specific value.
FALSE ## Footnote It analyzes unlabeled data to discover inherent similarities.
79
What is the main goal of **K-means clustering**?
* Finding similar data * Creating clusters from a dataset ## Footnote It minimizes variance between centroids and observations.
80
What are the components of a **neural network**?
* Input layer * Hidden layers * Output layer ## Footnote It uses forward and backward propagation for calculations and learning.
81
What is the goal of **reinforcement learning**?
* Responding to an environment in real time * Achieving a specified goal ## Footnote Examples include autonomous vehicles and game-playing computers.
82
What is a risk associated with relying solely on **expert opinion** in cost estimating?
* Heuristic and cognitive bias ## Footnote Examples include availability heuristic and optimism bias.
83
What is the purpose of **logistic regression** in cost analysis?
* Predicting project cost risk * Assessing whether a project is high risk (1) or low risk (0) ## Footnote It uses historical data to build a model.
84
What does a positive coefficient in a **logistic regression model** indicate?
* Higher likelihood of the outcome being 1 ## Footnote Negative coefficients decrease the likelihood of the outcome being 1.
85
Fill in the blank: **Neural networks** are based on computational models inspired by the way the human brain processes _______.
information ## Footnote They consist of interconnected nodes (neurons) that solve complex problems.
86
What does the **magnitude and sign of the coefficients** in a trained model indicate?
* Strength of feature's influence * Direction of feature's influence ## Footnote Positive coefficients increase the likelihood of outcomes being 1, while negative coefficients decrease it.
87
What is the **logistic regression model equation** used to predict the probability of a new project being high risk?
P = 1 / (1 + e^-(β₀ + β₁X₁ + β₂X₂ + ... + βnXn) ## Footnote Where β₀ is the intercept, βi are the coefficients, and Xi are the feature values.
88
In the context of the logistic regression model, what does **β₀** represent?
Intercept ## Footnote It is the constant term in the logistic regression equation.
89
What is the significance of a **complexity score** in relation to cost risk?
Most significant positive influence on cost risk ## Footnote A higher complexity score increases the likelihood of high cost risk.
90
How does **team experience** influence cost risk according to the model?
Significant negative influence ## Footnote Experienced teams reduce the likelihood of high cost risk.
91
Calculate the **log-odds** for a new project with the following parameters: project size of $4M, duration of 10 months, team experience of 6 years, complexity of 6, and market volatility index of 65.
Log-odds = -8.0 + (0.9*4) + (0.5*10) + (-0.8*6) + (0.2*9) + (1.5*6) + (0.05*65) ## Footnote The calculation results in log-odds = 0.85.
92
What is the predicted probability **P** that the project is high risk if the log-odds is 0.85?
P = 0.99995 ## Footnote This indicates a very high likelihood that the project is high risk.
93
True or false: Good data analysis begins with **good data**.
TRUE ## Footnote A successful analyst must understand the data and the processes of analysis.
94
What are the **best presentation forms** of data analysis?
* Scatter plots * Histograms * Bar charts ## Footnote These graphics help in effectively communicating data analysis results.
95
Understanding the **types of data available** helps analysts to:
Properly select the right analytical methods ## Footnote This understanding is crucial for effective data analysis.
96
What do **ML algorithms** provide in the context of data analysis?
Additional layers of data analysis ## Footnote They offer innovative ways of classifying and grouping project data.