Bloomberg 1st Technical Flashcards

(282 cards)

1
Q

What is a hash map (dictionary) and why is it useful?

A

A hash map stores key-value pairs for fast lookup, insertion, and deletion with an average O(1) time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the difference between an array and a linked list?

A

arrays have a fixed size, linked lists have dynamic size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the difference between a for loop and a while loop?

A

for loops iterate a fixed number of times or over a sequence; while loops continue until a condition is false

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What’s the time complexity for searching for an element in an unsorted array?

A

o(n), since every element may need to be checked

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What’s the difference between mutable and immutable data types in python?

A

mutable types (e.g., lists, dicts) can be changed in place; immutable types (e.g., strings, tuples) cannot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is sampling?

A

selecting a subset of data from a population to make inferences about the whole population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the central limit theorem (CLT) ?

A

The CLT states that the sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of population distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is hypothesis testing?

A

a method to test a claim about a population using sample data. Involves a null hypothesis (H_0), alternate hypothesis (H_1), and p-value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is an A/B test?

A

A controlled experiment comparing two versions (A and B) to determine which performs better on a specific metric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are Type I and type II errors?

A

Type I (false positive): reject true H_0
Type II (false negative): fail to reject false H_0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is p-value?

A

The probability of observing data as extreme as, or more extreme than, your sample assuming H_0 is true.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the difference between descriptive and inferential statistics?

A

Descriptive summarizes data (mean, median, std). Inferential draws conclusions from the data (hypothesis tests, confidence intervals).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are overfitting and underfitting in ML?

A

Overfitting: model fits training data too closely, leading to poor generalization.
Underfitting: model too simple to capture patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is bias-variance tradeoff?

A

Increasing model complexity reduces bias but increases variance; goal is to find balance for best generalization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are measures of central tendency and variability?

A

Central tendency: mean, median, mode
Variability: range, variance, standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is data normalization?

A

Process of organizing data to reduce redundancy and improve integrity, often by splitting data into related tables with foreign keys.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the normal forms in database normalization?

A

1NF - atomic columns
2NF - no partial dependency
3NF - no transitive dependency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a primary key vs. a foreign key?

A

A primary key uniquely identifies a row. Foreign key references a primary key in another table to link related data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a taxonomy in data modeling?

A

A hierarchical classification system (e.g., product categories -> subcategories -> items)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does “combining datasets” mean?

A

Merging or joining data from multiple sources to create a unified dataset. Common operations: inner, outer, left, right joins.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is data optimization?

A

Process of improving data storage, retrieval, and query performance using indexing, denormalization, caching or partitioning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How can you iterate over both index and value in Python?

A

Use enumerate(), e.g., for i, val in enumerate(my_list):

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How can you count occurrences efficiently in Python?

A

collections.Counter() - returns a dict-like object mapping elements to counts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the difference between pass, continue, and break in loops?

A

pass: does nothing, placeholder
continue: skip to next iteration
break: exit loop entirely

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What's the difference between == and is in Python?
== checks value equality, is checks identity (whether they're the same object in memory)
26
What is Big O notation?
Describes the upper bound of an algorithm's growth rate as input size increases; measures efficiency and scalability.
27
What's the space vs. time complexity tradeoff?
Sometimes we use extra memory (space) to reduce computation time (e.g., caching, precomputation)
28
What is hash collision and how is it handled?
Two keys hashing to the same index; resolved via chaining (linked lists) or open addressing (probing)
29
What is the difference between a stack and a queue?
Stack: LIFO Queue: FIFO, implemented using list of collections.deque
30
What is sampling bias?
Occurs when the sample is not representative of the population, leading to skewed results.
31
What is stratified sampling?
Dividing the population into subgroups (strata) and sampling proportionally from each to ensure representation
32
What is standard error?
The standard deviation of the sampling distribution of a statistic (usually the mean).
33
What is a confidence interval?
A range of values likely to contain the population parameter with a certain confidence level (e.g., 95%)
34
What is correlation vs causation?
correlation = variables move together; causation = one variable causes change in another
35
What is multicollinearity?
When independent variables in regression are highly correlated, making coefficient estimates unstable.
36
What is the null hypothesis in A/B testing?
There is no difference between group A and B; any observed difference is due to chance.
37
What is the t-test used for?
Compares means between two groups when population standard deviation is unknown.
38
What is p-hacking?
Manipulating data or testing repeatedly until statistical significance (p < 0.05) is achieved.
39
What is the difference between precision and recall?
Precision = quality of positives; recall = completeness of positives
40
What is F1 score?
harmonic mean of precision and recall
41
What are common causes of overfitting?
too many features, not enough data, lack of regularization, model memorizes noise
42
What methods are used to prevent overfitting?
cross-validation, regularization (L1/L2), dropout, pruning, early stopping and simplifying model
43
What is denormalization and why use it?
Combing tables to reduce joins and improve query speed, at the cost of redundancy
44
What are fact and dimension tables?
Fact tables store measurable data (e.g., sales) Dimension tables store descriptive attributes (e.g., product, time, region)
45
What is an ER diagram?
Entity-relationship diagram - a visual representation of entities (tables) and their relationships (keys, cardinality)
46
What are the types of database relationships?
one-to-one, one-to-many, and many-to-many
47
What is referential integrity?
Ensures foreign key values always reference valid primary keys - prevents orphan records
48
What is data redundancy?
Duplication of data across tables, leading to inconsistencies and wasted storage.
49
What is data granularity?
Level of detail represented in a dataset
50
What is an index in a database?
Data structure (often B-tree) that speeds up data retrieval at the cost of extra space and slower writes
51
What is data optimization in modeling?
Techniques to improve performance: indexing, partitioning, caching, query tuning, normalization/denormalization balance.
52
A/B test results show version B’s mean click-through rate (CTR) is higher than A’s, but not statistically significant. What does that mean?
The observed difference may be due to random variation — we fail to reject the null hypothesis.
53
You’re testing a new recommendation model. How would you detect overfitting?
Compare training vs validation performance — if training accuracy ≫ validation accuracy → overfitting. Fix with regularization, dropout, or more data.
54
You’re given a dataset with customer purchases and want to compare spending between two demographics. What test do you use?
- Use t-test for comparing means of two groups. - Check assumptions (normality, variance). - If data skewed → non-parametric (Mann–Whitney U test).
55
You find that customers who see more ads spend more. How do you determine if ads cause higher spending?
Correlation ≠ causation. Use A/B testing (random assignment) to isolate effect or use causal inference techniques (e.g., difference-in-differences, propensity scoring).
56
You need to summarize a dataset’s distribution before modeling. Which descriptive statistics would you use?
Mean, median, mode (central tendency); variance, std, IQR (spread); skewness and kurtosis (shape); missing values count.
57
Your A/B test p-value is 0.03 at α = 0.05. What’s your conclusion?
Reject the null hypothesis — there’s a statistically significant difference between A and B at the 5% level.
58
You get a p-value of 0.25 in your test. Should you assume there’s no difference?
No — failure to reject H₀ ≠ proof of no effect. It means evidence is insufficient; may need more data or power.
59
Your model’s test accuracy improved after adding new features, but validation accuracy dropped. Why?
Likely overfitting — model learned noise in training set. Remove redundant features or regularize.
60
How would you detect outliers in a dataset of trade volumes?
Visualize using boxplot or scatterplot.
61
If two datasets are combined and the mean changes drastically, what might that indicate?
Possible sampling bias or Simpson’s paradox — combined groups differ systematically.
62
How would you handle missing values when joining multiple datasets?
Use outer joins to preserve all data and then handle nulls: - Fill with defaults (mean/median) - Impute logically (e.g., 0 for missing sales) - Or drop rows if minimal impact.
63
You’re normalizing a data model. What problem are you trying to avoid?
Redundancy and update anomalies. Normalization ensures each piece of data is stored once and referenced via keys.
64
You notice queries on a normalized schema are slow. What’s one optimization strategy?
Denormalization — merge tables for faster reads (fewer joins).
65
How would you design a schema to track daily stock prices per company?
Use a fact table for prices and dimension tables for companies and dates. Fact: price_id, company_id, date_id, open, close, volume.
66
Your dataset contains categorical columns like “industry” and “region.” How would you model or encode them?
Use label encoding for ordinal data; one-hot encoding for nominal. E.g. pandas.get_dummies() or sklearn.OneHotEncoder().
67
You’re tasked to “combine datasets with different taxonomies” — what does that mean?
The data uses different classification systems (e.g., industry codes differ). You must align them via a mapping table or crosswalk to ensure consistency.
68
You’re asked to validate that your random sampling function works correctly. How do you test it statistically?
Compare sampled and full dataset distributions using: Kolmogorov-Smirnov test (for continuous data) Chi-square test (for categorical) Check summary stats (mean, std) are similar.
69
You run an A/B test, and version B has a higher mean conversion rate but a higher variance too. What’s your next step?
Run a t-test accounting for variance. If p-value < α, difference is significant. Otherwise, consider more samples to reduce uncertainty.
70
You’re tasked to “normalize” a dataset before machine learning. What does that mean in this context?
Scale numeric features to a consistent range or distribution: Min-max scaling: (x − min) / (max − min) Standardization: (x − μ) / σ Purpose: prevent features with large magnitudes from dominating.
71
A data source sends you daily updates, but duplicates appear over time. How do you ensure you only keep the latest version of each record?
Sort or group by a unique ID and timestamp, then keep the latest
72
You have 1M rows of price data and want to compute the moving average efficiently. What data structure or technique helps?
Use a sliding window technique — maintain a rolling sum to avoid recomputation.
73
Your model’s RMSE (root mean squared error) decreased after feature engineering, but R² stayed the same. What could that mean?
Scale or distribution of target might have changed; improvement may not be meaningful relative to total variance. Always inspect residual plots and validate on test data.
74
You’re analyzing clickstream data and notice timestamps are in different time zones. What’s the correct approach?
Convert all timestamps to a common timezone (usually UTC) before aggregation or modeling to maintain consistency.
75
Your dataset has a column with “Yes/No” responses stored inconsistently ('Y', 'y', 'Yes', 'yes', '1'). How would you clean it?
Normalize values to consistent lowercase, map all valid positives to 1, negatives to 0.
76
What is a taxonomy?
a structured classification system that organizes concepts or entities into hierarchical categories based on stored characteristics
77
What is the structure of a taxonomy?
parent nodes are broader concepts, child nodes are more specific subtypes
78
Who did Bloomberg give the first 23 terminals to?
Merrill Lynch
79
What is ontology?
it adds relationships and rules, not just hierarchy
80
What is an example of an ontology?
equity is traded on exchange
81
What is schema?
defines structure of data fields and their types
82
How does bloomberg use taxonomies and ontologies?
they use ontologies built from taxonomies, especially things like financial instruments, entities, and news categorization
83
What is the consistency principle for taxonomies?
same logic at every level, not mixing types under the same parent
84
What is the mutual exclusivity principle for taxonomies?
categories don't overlap, subtypes are specific to one parent only
85
What is the single inheritance principle for taxonomies?
each child has one parent
86
What is the exhaustiveness/completeness principle for taxonomies?
all relevant categories are represented
87
What is the proper granularity principle for taxonomies?
levels are at consistent detail - not skipping categories, not going too deep too soon
88
What is the clear labels principle for taxonomies?
names are unambiguous and standardized
89
What is the no cycles principle for taxonomies?
should form a tree or DAG, not loops
90
What should a clean taxonomy improve?
search, data linking, analytics, and knowledge graph accuracy
91
What is data annotation/labeling?
the process of adding metadata or labels to raw data to make it usable for ML or data analysis
92
What would the data annotation for text be?
sentiment = "positive"
93
What would the data annotation for image be?
object = "cat"
94
What would the data annotation for audio be?
transcription = "hello world"
95
What would the data annotation for financial news be?
category = "mergers & acquisitions"
96
What are the 3 ways labeling can be done?
manual labeling, automated labeling, annotation quality
97
What format should dates be within tables?
ISO format (YYYY - MM - DD)
98
What format should country be within tables?
ISO country codes
99
What are the three types of missingness within data?
MCAR, MAR, and MNAR
100
What is MCAR?
missing completely at random; ex: random glitch, no pattern
101
What is MAR?
missing at random and depends only on observed data not the missing value itself; ex: income missing depends on job type
102
What is MNAR?
missing not at random; ex: people with higher income don't report it
103
What are some approaches to handling missing data?
drop missing rows if small amount, impute (mean/median/mode), domain driven handing ("NaN" if income=unemployed)
104
How would you handle MCAR data?
safest for dropping rows with missing values and won't affect the analysis
105
How would you handle MAR data?
multiple imputation is most appropriate
106
How would you handle MNAR data?
standard imputation
107
What is random sampling?
every record has an equal chance of being picked
108
What is stratified sampling?
split by key groups and we sample equally from them
109
What is systemic sampling?
taking every kth observation
110
What is cluster sampling?
randomly choosing groups
111
What kind of sampling should we use for imbalanced data/rare results?
stratified sampling to ensure all classes are represented
112
What are the data management principles?
data governance, data quality, data security and privacy, data architecture, data lifecycle management, metadata management, data integration & interoperability, MDM, data ethics & compliance, data strategy & value
113
What is the data management principles mnemonic?
GQ-SPA-LIMES
114
What does GQ-SPA-LIMES stand for?
governance, quality, security, privacy, architecture, lifecycle, integration, metadata, ethics, strategy
115
What is the definition of data governance?
a framework of policies, processes, roles, ad standards that ensures data is managed consistently and responsibly across the organization
116
What is the goal of data governance?
define ownership and accountability for data assets, ensure compliance with legal, regulatory, and ethical standards, enable data-driven decision-making through trusted data
117
What is the definition of data quality?
process that ensures data is accurate, complete, consistent, valid, timely, and unique
118
What is the definition for data security & privacy?
practices and technologies that protect data from unauthorized access, corruption, or misuse, while ensuring compliance with privacy regulations
119
What is the goal of data quality?
improve trust in analytics and operational systems, reduce errors caused by incorrect or incomplete data
120
What is the goal of data security & privacy?
maintain confidentiality, integrity and availability of data, ensure compliance with GDPR, CCPA, HIPAA, protect sensitive and personally identifiable information
121
What is the definition for data architecture?
the conceptual and logical design of how data flows through systems, stored, and used
122
What is the goal of data architecture?
ensure scalability, interoperability, efficiency, enable cross-platform data integration, provide the foundation for analytics, ML, and reporting systems
123
What is the definition for data lifecycle management?
managing data from its creation, use, storage, and archival to deletion, following organizational and legal retention rules
124
What is the goal of data lifecycle management?
control data growth and storage costs, ensure timely deletion or archival for compliance, and maintain version control and history tracking
125
What is the definition for metadata management?
the management of information describing other data - e.g., where it comes from, how it's used, what it means
126
What is the goal of metadata management?
make data discoverable and understandable, enable data lineage tracking and impact analysis, support governance and documentation
127
What is the definition for data integration and interoperability?
the process of combining data from multiple systems and formats into a unified, consistent, and usable form
128
What is the goal of data integration and interoperability?
ensure seamless data exchange between applications and platforms, enable a single view of data across the organization
129
What is the definition for master + reference data management?
creating and maintaining a single, consistent version of key business entities such as customers, suppliers, products, and employees
130
What is the goal of MDM?
eliminate duplication and ensure consistent identifiers, improve operational accuracy and reporting
131
What is the definition for data ethics and compliance?
ensuring data is used in a way that is fair, transparent, and aligned with ethical and societal norms
132
What is the definition for data strategy + value?
aligning data initiatives with organizational goals to create measurable business impact
133
What is the goal for data strategy + value?
define how data drives business outcomes, measure the ROI of data initiatives, develop a roadmap for data maturity
134
Why does data governance matter?
organizations face inconsistent data definitions, duplication, and compliance risks
135
What are examples/tools of data governance?
collibra, alation, informatica data governance, microsoft purview
136
Why does data quality matter?
poor-quality data leads to biased models, wrong business insights, and operational inefficiencies
137
What is profiling in data quality?
assessing current data quality
138
What is cleansing in data quality?
correct errors, fill in missing values
139
What is validation in data quality?
check ranges, formats, and referential integrity
140
What is monitoring in data quality?
continuously track quality KPIs(e.g. accuracy rate)
141
Why does data security and privacy matter?
breaches erode trust, lead to fines, and cause reputational damage
142
What are key activities of data security + privacy?
role-based access control, data encryption, regular security audits and penetration testing, implement data anonymization and tokenization
143
What are examples/tools of data security + privacy?
AWS KMS, HashiCorp Vault, Okta, OneTrust, Databricks Unity Catalog
144
Why does data architecture matter?
a poor architecture leads to data silos, high latency, and technical debt
145
What are key activities within data architecture?
define data models, establish storage layers: raw -> cleaned -> curated, implement data pipelines and integration workflows, standardize schema and documentation
146
What are examples/tools of data architecture?
snowflake, bigquery, aws redshift, apache kafka, airflow
147
Why does data lifecycle management matter?
mismanaged lifecycle data can result in non-compliance and data bloat
148
What are key activities of data lifecycle management?
define retention policies, classify data by sensitivity and usage, automate archival and purge processes, and monitor lifecycle transitions
149
What are examples/tools of data lifecycle management?
AWS S3 Lifecycle Policies, Google Cloud Storage Object Lifecycle, Azure Blob Lifecycle Management
150
Why does metadata management matter?
metadata is the foundation for data catalogs, discovery, and trust in analytics
151
What are some key activities in metadata management?
build data catalog or business glossary, track lineage (source -> transformations -> usage), and maintain semantic metadata (meaning and relationships)
152
What are some examples/tools of metadata management?
apache atlas, datahub, amundsen, alation, collibra
153
Why does data integration + interoperability matter?
reduces silos and allows for cross-domain insights (e.g. joining sales, customer, and product data)
154
What are key activities of data integration + interoperability?
design and manage ETL/ELT pipelines, implement API-based integrations, use data virtualization or federation for real-time interoperability
155
Why does MDM matter?
disparate systems often have inconsistent identifiers or definitions, leading to analytic errors
156
What are some key activities of MDM?
identify master data entities, create golden records, define matching and merging rules, and synchronize master data across systems
157
What are the goals for data ethics and compliance?
prevent harm from data misuse or biased AI models, promote fairness, accountability, and transparency, and uphold users' rights to consent and control
158
Why does data ethics + compliance matter?
ethical breaches damage brand trust and can lead to regulatory consequences
159
What are key activities for data ethics + compliance?
bias testing and fairness auditing, transparent data collection and consent processes, establish ethical review boards for AI/ML projects
160
Why does data strategy + value matter?
without strategy, even strong data systems won't produce meaningful business outcomes
161
What are key activities for data strategy + value?
define KPIs and business objectives, prioritize high-impact data projects, foster a data-driven culture across teams
162
What are examples/tools of data strategy + value?
OKR frameworks, balances scorecards, Snowflake and Tableau dashboards for impact tracking
163
What is the mean?
the average values
164
What is the median?
middle value - robust to outliers
165
What is the mode?
the most frequent value
166
What is the variance/standard deviation?
it measures the spread of data
167
What are percentiles?
distribution cutoffs
168
What is the A in A/B testing?
the control
169
What is the B in A/B testing?
the test
170
What is the MAE (Mean Absolute Error)?
the average magnitude of errors
171
What is the MSE/RMSE?
penalize larg errors
172
What is MAPE?
% error
173
What is the standard error?
sampling uncertainty
174
What is the bias-variance tradeoff?
balance between complexity and generalization
175
What is data normalization?
the process of organizing data into related tables to reduce redundancy and improve integrity
176
What is the goal of optimization in data modeling?
speed + storage efficiency + query performance
177
What do indexes allow us to do?
speed up lookups, like indexing on ticker_symbol for faster search
178
What does partitioning allow us to do?
split large datasets (e.g., partition trades by year or region)
179
What does denormalization allow us to do?
store redundant data for read-heavy systems (e.g. store a company name directly in transaction table)
180
What does caching allow us to do?
precompute frequent queries (e.g. cache top 100 traded stocks)
181
What does columnar storage allow us to do?
for analytics workloads (e.g. use parquet instead of csv for large queries)
182
What is supervised learning?
model learns from labeled data (ex: predict stock price given historical data)
183
What is unsupervised learning?
model finds patterns in unlabeled data (ex: group customers by spending habits)
184
What is semi-supervised learning?
mix of labeled and unlabeled data
185
What is reinforcement learning?
agent learns by interacting with environment to maximize reward
186
What is feature engineering?
selecting and transforming variables to improve model performance
187
What is feature selection?
keeping only the most predictive features (e.g., drop low-variance or correlated columns using a heatmap)
188
What is dimensionality reduction?
reduce # of features while retaining variance
189
What can overfitting tell us about variance?
high variance
190
What does underfitting tell us about bias?
high bias
191
What are the core concepts for linear regression?
supervised, predict numeric output
192
What are the core concepts for logistic regression?
supervised classification, predict binary outcomes and outputs probabilities
193
What are the core concepts for decision trees?
supervised, can predict binary or numeric outputs, prone to overfitting
194
What are the core concepts for a random forest model?
ensemble, used for classification/regression, averages many trees and reduces variance
195
What are the core concepts for gradient boosting models?
XGBoost, LightGBM, ensemble, used with structured data, and have strong predictive power
196
What are the core concepts for SVM (Support Vector Machine) models?
supervised, used for classification, are effective in high dimensions
197
What are the core concepts for KNN (K-Nearest Neighbors) models?
supervised, used for classification, simple and non-parametrics
198
What are the core concepts for Naive Bayes models?
supervised, used for text classification, and based on bayes' theorem
199
What are the core concepts for K-Means models?
unsupervised, used for clustering, and groups by centroid distance
200
What are the core concepts for hierarchical clustering models?
unsupervised, used for group discovery, and dendrogam structure
201
What are the core concepts for PCA (principal component analysis) models?
unsupervised, used for dimensionality reduction, and finds uncorrelated components
202
What are the core concepts for Neural Networks models?
supervised, used for complex patterns, and have layers of weights/activations
203
What are the core concepts for CNN (Convolutional Neural Network) models?
deep learning, used for image data, detects spatial features
204
What are the core concepts for RNN/LSMT?
deep learning, used for sequential data, used for time-series or NLP
205
What are the core concepts for transformer models?
deep learning, used for text/nlp, BERT, GPT; use attention mechanisms
206
What are evaluation metrics used for classificiation?
accuracy, precision, recall, f1, roc-auc
207
What are evaluation metrics for regression?
MAE, MSE, RMSE, R^2
208
What are evaluation metrics for ranking?
NDCG, MAP, MRR
209
What are evaluation metrics for clustering?
silhouette score, davies-bouldin index
210
What do evaluation metrics for classification score?
quality of class predictions
211
What do evaluation metrics for regression score?
deviation from true values
212
What do evaluation metrics for clustering score?
separation and cohesion of clusters
213
What is gradient descent?
iteratively adjusts parameters to minimize loss
214
What is learning rate?
step size for updates (too high -> unstable, too low -> slow)
215
What is regularization?
penalizes complexity (L1 = Lasso, L2 = Ridge)
216
What is Batch vs Mini-batch?
tradeoff between speed and stability in training
217
What is cross-validation?
split into k folds to validate model robustness
218
What is Natural Language Processing (NLP) ?
teaching models to understand text
219
What is computer vision (CV)?
understanding images
220
What is generative AI?
creating new data based on learning patterns
221
What is Agentic AI?
AI systems that can autonomously act using tools or memory
222
What are Knowledge Graphs?
graph-based data representation for relationships
223
What is embedding?
vector representation of entities (ex: "Apple" -> [0,23, -0.14, 0.89, ..])
224
What is Transfer Learning?
using pre-trained models for new tasks
225
What is data leakage?
using future data during training
226
What is imbalanced data?
uneven class ratios
227
What are high correlation features?
redundant info
228
What are scaling issues?
features with different ranges (use minmaxscaler or standardscaler)
229
What is model drift?
data distribution changes over time
230
What is a confusion matrix?
table showing TP, FP, FN, TN
231
What is the precison-recall tradeoff?
high recall catches more positives, may lower precision
232
What is a ROC Curve?
Plots TPR vs FPR; AUC measures overall quality
233
What is feature importance?
which variables influence model most
234
What are different ways of hyperparameter tuning?
GridSearchCV, RandomSearch, Bayesian Optimization
235
What is ensemble learning?
combines models (bagging, boosting, stacking)
236
What is bagging?
bootstrap aggregation; training multiple models on random subsets of data, average results
237
What is boosting?
sequentially train models to correct predecessor errors (XGBoost, AdaBoost, CatBoost)
238
What is stacking?
combining different models' predictions via meta-model (Blend SVM, Tree)
239
What is blending?
similar to stacking with holdout validation instead of k-fold
240
What is meta-learning?
"learning to learn" - models that adapt across stacks
241
What is self-supervised learning?
use data's own structure for labeling (making word prediction - BERT)
242
What is contrastive learning?
learn representations by comparing similar vs dissimilar pairs
243
What is one-hot encoding?
represent categorical features as binary vectors
244
What is target encoding?
replace category with mean target value
245
What are polynomial features?
add interaction terms
246
What is feature scaling?
normalize feature ranges (StandardScaler -> mean 0, std 1)
247
What is imbalanced learning?
handle skewed classes (SMOTE, undersampling, weighted loss)
248
What is dimensionality reduction?
reduce features while preserving relationships
249
What is feature hashing?
convert high-cardinality text to fixed length vectors
250
What is regularization?
prevents overfitting by penalizing large weights
251
What is dropout?
randomly deactivate neurons during training (common in neural nets)
252
What is batch normalization?
normalize activations per mini-batch (speeds up deep network training)
253
What is early stopping?
stop training when validation loss stops improving (avoids overfit)
254
What is momentum?
speeds convergence by smoothing gradient updates
255
What is Adam Optimizer?
adaptive learning rate using momentum and RMS
256
What is Learning Rate Scheduler?
adjusts learning rate dynamically
257
What is tokenization?
split text into tokens (Bloomberg AI rocks -> bloomberg, ai, rocks)
258
What is embedding?
vector representation of words
259
What are examples of embedding ?
Word2Vec, GloVe, BERT embeddings
260
What is sequence-to-sequence?
maps input -> output sequences
261
What is attention mechanism?
focus on relevant parts of input
262
What is named entity recognition?
identify entities(Person, Org, Location)
263
What is sentiment analysis?
predict polarity (positive/negative)
264
What is Retrieval-Augmented Generation (RAG) ?
combines retrieval and generation for factual accuracy (ex: chatbots that search documents)
265
What are the six dimensions of data quality?
completeness, accuracy, consistency, uniqueness, timeliness
266
What makes a table not optimized?
same information stored repeatedly, changing one record requires updating many others, can’t add new data because other data is missing, deleting a record unintentionally removes important info, and poor query performance — wide tables with repeated text strings slow down queries
266
What is 1NF?
Each cell holds a single value (no lists or nested data)
267
What is 2NF?
Every non-key column depends on the entire primary key
268
What is 3NF?
No transitive dependencies (non-key attributes shouldn’t depend on other non-key attributes)
269
In pandas, how do you get the very first rows?
df.head()
270
In pandas, how do you get shape?
df.shape
271
In pandas, how do you get data types?
df.dtypes
272
In pandas, how do you get summary statistics?
df.describe()
273
In pandas, how do you sort values?
df.sort_values("GPA", ascending=False)
274
In pandas, how do you get unique values?
df["major"].unique()
275
In pandas, how do you get missing % per column?
df.isna().mean() * 100
276
In pandas, how do you get standard deviation (spread)?
df["GPA"].std()
277
In pandas, how do you get quantiles?
df["GPA"].quantile([0.25, 0.5, 0.75])
278
In pandas, how do you visualize numerical data in a histogram?
df["GPA"].hist(bins=10)
279
In pandas, how do you create a boxplot?
df.boxplot(column="GPA")
280
In pandas, how do you visualize categorical data?
df["major"].value_counts().plot(kind="bar") # Bar chart
281
How do you do IQR method for outlier detection?
Q1 = df["GPA"].quantile(0.25) Q3 = df["GPA"].quantile(0.75) IQR = Q3 - Q1 outliers = df[(df["GPA"] < Q1 - 1.5*IQR) | (df["GPA"] > Q3 + 1.5*IQR)]