Bloomberg 1st Technical Flashcards

Question

What's the difference between == and is in Python?

Answer 1

== checks value equality, is checks identity (whether they're the same object in memory)

Answer 2

Describes the upper bound of an algorithm's growth rate as input size increases; measures efficiency and scalability.

Answer 3

Sometimes we use extra memory (space) to reduce computation time (e.g., caching, precomputation)

Answer 4

Two keys hashing to the same index; resolved via chaining (linked lists) or open addressing (probing)

Answer 5

Stack: LIFO Queue: FIFO, implemented using list of collections.deque

Answer 6

Occurs when the sample is not representative of the population, leading to skewed results.

Answer 7

Dividing the population into subgroups (strata) and sampling proportionally from each to ensure representation

Answer 8

The standard deviation of the sampling distribution of a statistic (usually the mean).

Answer 9

A range of values likely to contain the population parameter with a certain confidence level (e.g., 95%)

Answer 10

correlation = variables move together; causation = one variable causes change in another

Answer 11

When independent variables in regression are highly correlated, making coefficient estimates unstable.

Answer 12

There is no difference between group A and B; any observed difference is due to chance.

Answer 13

Compares means between two groups when population standard deviation is unknown.

Answer 14

Manipulating data or testing repeatedly until statistical significance (p < 0.05) is achieved.

Answer 15

Precision = quality of positives; recall = completeness of positives

Answer 16

harmonic mean of precision and recall

Answer 17

too many features, not enough data, lack of regularization, model memorizes noise

Answer 18

cross-validation, regularization (L1/L2), dropout, pruning, early stopping and simplifying model

Answer 19

Combing tables to reduce joins and improve query speed, at the cost of redundancy

Answer 20

Fact tables store measurable data (e.g., sales) Dimension tables store descriptive attributes (e.g., product, time, region)

Answer 21

Entity-relationship diagram - a visual representation of entities (tables) and their relationships (keys, cardinality)

Answer 22

one-to-one, one-to-many, and many-to-many

Answer 23

Ensures foreign key values always reference valid primary keys - prevents orphan records

Answer 24

Duplication of data across tables, leading to inconsistencies and wasted storage.

Answer 25

Level of detail represented in a dataset

Answer 26

Data structure (often B-tree) that speeds up data retrieval at the cost of extra space and slower writes

Answer 27

Techniques to improve performance: indexing, partitioning, caching, query tuning, normalization/denormalization balance.

Answer 28

The observed difference may be due to random variation — we fail to reject the null hypothesis.

Answer 29

Compare training vs validation performance — if training accuracy ≫ validation accuracy → overfitting. Fix with regularization, dropout, or more data.

Answer 30

- Use t-test for comparing means of two groups. - Check assumptions (normality, variance). - If data skewed → non-parametric (Mann–Whitney U test).

Answer 31

Correlation ≠ causation. Use A/B testing (random assignment) to isolate effect or use causal inference techniques (e.g., difference-in-differences, propensity scoring).

Answer 32

Mean, median, mode (central tendency); variance, std, IQR (spread); skewness and kurtosis (shape); missing values count.

Answer 33

Reject the null hypothesis — there’s a statistically significant difference between A and B at the 5% level.

Answer 34

No — failure to reject H₀ ≠ proof of no effect. It means evidence is insufficient; may need more data or power.

Answer 35

Likely overfitting — model learned noise in training set. Remove redundant features or regularize.

Answer 36

Visualize using boxplot or scatterplot.

Answer 37

Possible sampling bias or Simpson’s paradox — combined groups differ systematically.

Answer 38

Use outer joins to preserve all data and then handle nulls: - Fill with defaults (mean/median) - Impute logically (e.g., 0 for missing sales) - Or drop rows if minimal impact.

Answer 39

Redundancy and update anomalies. Normalization ensures each piece of data is stored once and referenced via keys.

Answer 40

Denormalization — merge tables for faster reads (fewer joins).

Answer 41

Use a fact table for prices and dimension tables for companies and dates. Fact: price_id, company_id, date_id, open, close, volume.

Answer 42

Use label encoding for ordinal data; one-hot encoding for nominal. E.g. pandas.get_dummies() or sklearn.OneHotEncoder().

Answer 43

The data uses different classification systems (e.g., industry codes differ). You must align them via a mapping table or crosswalk to ensure consistency.

Answer 44

Compare sampled and full dataset distributions using: Kolmogorov-Smirnov test (for continuous data) Chi-square test (for categorical) Check summary stats (mean, std) are similar.

Answer 45

Run a t-test accounting for variance. If p-value < α, difference is significant. Otherwise, consider more samples to reduce uncertainty.

Answer 46

Scale numeric features to a consistent range or distribution: Min-max scaling: (x − min) / (max − min) Standardization: (x − μ) / σ Purpose: prevent features with large magnitudes from dominating.

Answer 47

Sort or group by a unique ID and timestamp, then keep the latest

Answer 48

Use a sliding window technique — maintain a rolling sum to avoid recomputation.

Answer 49

Scale or distribution of target might have changed; improvement may not be meaningful relative to total variance. Always inspect residual plots and validate on test data.

Answer 50

Convert all timestamps to a common timezone (usually UTC) before aggregation or modeling to maintain consistency.

Answer 51

Normalize values to consistent lowercase, map all valid positives to 1, negatives to 0.

Answer 52

a structured classification system that organizes concepts or entities into hierarchical categories based on stored characteristics

Answer 53

parent nodes are broader concepts, child nodes are more specific subtypes

Answer 54

Merrill Lynch

Answer 55

it adds relationships and rules, not just hierarchy

Answer 56

equity is traded on exchange

Answer 57

defines structure of data fields and their types

Answer 58

they use ontologies built from taxonomies, especially things like financial instruments, entities, and news categorization

Answer 59

same logic at every level, not mixing types under the same parent

Answer 60

categories don't overlap, subtypes are specific to one parent only

Answer 61

each child has one parent

Answer 62

all relevant categories are represented

Answer 63

levels are at consistent detail - not skipping categories, not going too deep too soon

Answer 64

names are unambiguous and standardized

Answer 65

should form a tree or DAG, not loops

Answer 66

search, data linking, analytics, and knowledge graph accuracy

Answer 67

the process of adding metadata or labels to raw data to make it usable for ML or data analysis

Answer 68

sentiment = "positive"

Answer 69

object = "cat"

Answer 70

transcription = "hello world"

Answer 71

category = "mergers & acquisitions"

Answer 72

manual labeling, automated labeling, annotation quality

Answer 73

ISO format (YYYY - MM - DD)

Answer 74

ISO country codes

Answer 75

MCAR, MAR, and MNAR

Answer 76

missing completely at random; ex: random glitch, no pattern

Answer 77

missing at random and depends only on observed data not the missing value itself; ex: income missing depends on job type

Answer 78

missing not at random; ex: people with higher income don't report it

Answer 79

drop missing rows if small amount, impute (mean/median/mode), domain driven handing ("NaN" if income=unemployed)

Answer 80

safest for dropping rows with missing values and won't affect the analysis

Answer 81

multiple imputation is most appropriate

Answer 82

standard imputation

Answer 83

every record has an equal chance of being picked

Answer 84

split by key groups and we sample equally from them

Answer 85

taking every kth observation

Answer 86

randomly choosing groups

Answer 87

stratified sampling to ensure all classes are represented

Answer 88

data governance, data quality, data security and privacy, data architecture, data lifecycle management, metadata management, data integration & interoperability, MDM, data ethics & compliance, data strategy & value

Answer 89

GQ-SPA-LIMES

Answer 90

governance, quality, security, privacy, architecture, lifecycle, integration, metadata, ethics, strategy

Answer 91

a framework of policies, processes, roles, ad standards that ensures data is managed consistently and responsibly across the organization

Answer 92

define ownership and accountability for data assets, ensure compliance with legal, regulatory, and ethical standards, enable data-driven decision-making through trusted data

Answer 93

process that ensures data is accurate, complete, consistent, valid, timely, and unique

Answer 94

practices and technologies that protect data from unauthorized access, corruption, or misuse, while ensuring compliance with privacy regulations

Answer 95

improve trust in analytics and operational systems, reduce errors caused by incorrect or incomplete data

Answer 96

maintain confidentiality, integrity and availability of data, ensure compliance with GDPR, CCPA, HIPAA, protect sensitive and personally identifiable information

Answer 97

the conceptual and logical design of how data flows through systems, stored, and used

Answer 98

ensure scalability, interoperability, efficiency, enable cross-platform data integration, provide the foundation for analytics, ML, and reporting systems

Answer 99

managing data from its creation, use, storage, and archival to deletion, following organizational and legal retention rules

Answer 100

control data growth and storage costs, ensure timely deletion or archival for compliance, and maintain version control and history tracking

Answer 101

the management of information describing other data - e.g., where it comes from, how it's used, what it means

Answer 102

make data discoverable and understandable, enable data lineage tracking and impact analysis, support governance and documentation

Answer 103

the process of combining data from multiple systems and formats into a unified, consistent, and usable form

Answer 104

ensure seamless data exchange between applications and platforms, enable a single view of data across the organization

Answer 105

creating and maintaining a single, consistent version of key business entities such as customers, suppliers, products, and employees

Answer 106

eliminate duplication and ensure consistent identifiers, improve operational accuracy and reporting

Answer 107

ensuring data is used in a way that is fair, transparent, and aligned with ethical and societal norms

Answer 108

aligning data initiatives with organizational goals to create measurable business impact

Answer 109

define how data drives business outcomes, measure the ROI of data initiatives, develop a roadmap for data maturity

Answer 110

organizations face inconsistent data definitions, duplication, and compliance risks

Answer 111

collibra, alation, informatica data governance, microsoft purview

Answer 112

poor-quality data leads to biased models, wrong business insights, and operational inefficiencies

Answer 113

assessing current data quality

Answer 114

correct errors, fill in missing values

Answer 115

check ranges, formats, and referential integrity

Answer 116

continuously track quality KPIs(e.g. accuracy rate)

Answer 117

breaches erode trust, lead to fines, and cause reputational damage

Answer 118

role-based access control, data encryption, regular security audits and penetration testing, implement data anonymization and tokenization

Answer 119

AWS KMS, HashiCorp Vault, Okta, OneTrust, Databricks Unity Catalog

Answer 120

a poor architecture leads to data silos, high latency, and technical debt

Answer 121

define data models, establish storage layers: raw -> cleaned -> curated, implement data pipelines and integration workflows, standardize schema and documentation

Answer 122

snowflake, bigquery, aws redshift, apache kafka, airflow

Answer 123

mismanaged lifecycle data can result in non-compliance and data bloat

Answer 124

define retention policies, classify data by sensitivity and usage, automate archival and purge processes, and monitor lifecycle transitions

Answer 125

AWS S3 Lifecycle Policies, Google Cloud Storage Object Lifecycle, Azure Blob Lifecycle Management

Answer 126

metadata is the foundation for data catalogs, discovery, and trust in analytics

Answer 127

build data catalog or business glossary, track lineage (source -> transformations -> usage), and maintain semantic metadata (meaning and relationships)

Answer 128

apache atlas, datahub, amundsen, alation, collibra

Answer 129

reduces silos and allows for cross-domain insights (e.g. joining sales, customer, and product data)

Answer 130

design and manage ETL/ELT pipelines, implement API-based integrations, use data virtualization or federation for real-time interoperability

Answer 131

disparate systems often have inconsistent identifiers or definitions, leading to analytic errors

Answer 132

identify master data entities, create golden records, define matching and merging rules, and synchronize master data across systems

Answer 133

prevent harm from data misuse or biased AI models, promote fairness, accountability, and transparency, and uphold users' rights to consent and control

Answer 134

ethical breaches damage brand trust and can lead to regulatory consequences

Answer 135

bias testing and fairness auditing, transparent data collection and consent processes, establish ethical review boards for AI/ML projects

Answer 136

without strategy, even strong data systems won't produce meaningful business outcomes

Answer 137

define KPIs and business objectives, prioritize high-impact data projects, foster a data-driven culture across teams

Answer 138

OKR frameworks, balances scorecards, Snowflake and Tableau dashboards for impact tracking

Answer 139

the average values

Answer 140

middle value - robust to outliers

Answer 141

the most frequent value

Answer 142

it measures the spread of data

Answer 143

distribution cutoffs

Answer 144

the control

Answer 145

the average magnitude of errors

Answer 146

penalize larg errors

Answer 147

sampling uncertainty

Answer 148

balance between complexity and generalization

Answer 149

the process of organizing data into related tables to reduce redundancy and improve integrity

Answer 150

speed + storage efficiency + query performance

Answer 151

speed up lookups, like indexing on ticker_symbol for faster search

Answer 152

split large datasets (e.g., partition trades by year or region)

Answer 153

store redundant data for read-heavy systems (e.g. store a company name directly in transaction table)

Answer 154

precompute frequent queries (e.g. cache top 100 traded stocks)

Answer 155

for analytics workloads (e.g. use parquet instead of csv for large queries)

Answer 156

model learns from labeled data (ex: predict stock price given historical data)

Answer 157

model finds patterns in unlabeled data (ex: group customers by spending habits)

Answer 158

mix of labeled and unlabeled data

Answer 159

agent learns by interacting with environment to maximize reward

Answer 160

selecting and transforming variables to improve model performance

Answer 161

keeping only the most predictive features (e.g., drop low-variance or correlated columns using a heatmap)

Answer 162

reduce # of features while retaining variance

Answer 163

high variance

Answer 164

supervised, predict numeric output

Answer 165

supervised classification, predict binary outcomes and outputs probabilities

Answer 166

supervised, can predict binary or numeric outputs, prone to overfitting

Answer 167

ensemble, used for classification/regression, averages many trees and reduces variance

Answer 168

XGBoost, LightGBM, ensemble, used with structured data, and have strong predictive power

Answer 169

supervised, used for classification, are effective in high dimensions

Answer 170

supervised, used for classification, simple and non-parametrics

Answer 171

supervised, used for text classification, and based on bayes' theorem

Answer 172

unsupervised, used for clustering, and groups by centroid distance

Answer 173

unsupervised, used for group discovery, and dendrogam structure

Answer 174

unsupervised, used for dimensionality reduction, and finds uncorrelated components

Answer 175

supervised, used for complex patterns, and have layers of weights/activations

Answer 176

deep learning, used for image data, detects spatial features

Answer 177

deep learning, used for sequential data, used for time-series or NLP

Answer 178

deep learning, used for text/nlp, BERT, GPT; use attention mechanisms

Answer 179

accuracy, precision, recall, f1, roc-auc

Answer 180

MAE, MSE, RMSE, R^2

Answer 181

NDCG, MAP, MRR

Answer 182

silhouette score, davies-bouldin index

Answer 183

quality of class predictions

Answer 184

deviation from true values

Answer 185

separation and cohesion of clusters

Answer 186

iteratively adjusts parameters to minimize loss

Answer 187

step size for updates (too high -> unstable, too low -> slow)

Answer 188

penalizes complexity (L1 = Lasso, L2 = Ridge)

Answer 189

tradeoff between speed and stability in training

Answer 190

split into k folds to validate model robustness

Answer 191

teaching models to understand text

Answer 192

understanding images

Answer 193

creating new data based on learning patterns

Answer 194

AI systems that can autonomously act using tools or memory

Answer 195

graph-based data representation for relationships

Answer 196

vector representation of entities (ex: "Apple" -> [0,23, -0.14, 0.89, ..])

Answer 197

using pre-trained models for new tasks

Answer 198

using future data during training

Answer 199

uneven class ratios

Answer 200

redundant info

Answer 201

features with different ranges (use minmaxscaler or standardscaler)

Answer 202

data distribution changes over time

Answer 203

table showing TP, FP, FN, TN

Answer 204

high recall catches more positives, may lower precision

Answer 205

Plots TPR vs FPR; AUC measures overall quality

Answer 206

which variables influence model most

Answer 207

GridSearchCV, RandomSearch, Bayesian Optimization

Answer 208

combines models (bagging, boosting, stacking)

Answer 209

bootstrap aggregation; training multiple models on random subsets of data, average results

Answer 210

sequentially train models to correct predecessor errors (XGBoost, AdaBoost, CatBoost)

Answer 211

combining different models' predictions via meta-model (Blend SVM, Tree)

Answer 212

similar to stacking with holdout validation instead of k-fold

Answer 213

"learning to learn" - models that adapt across stacks

Answer 214

use data's own structure for labeling (making word prediction - BERT)

Answer 215

learn representations by comparing similar vs dissimilar pairs

Answer 216

represent categorical features as binary vectors

Answer 217

replace category with mean target value

Answer 218

add interaction terms

Answer 219

normalize feature ranges (StandardScaler -> mean 0, std 1)

Answer 220

handle skewed classes (SMOTE, undersampling, weighted loss)

Answer 221

reduce features while preserving relationships

Answer 222

convert high-cardinality text to fixed length vectors

Answer 223

prevents overfitting by penalizing large weights

Answer 224

randomly deactivate neurons during training (common in neural nets)

Answer 225

normalize activations per mini-batch (speeds up deep network training)

Answer 226

stop training when validation loss stops improving (avoids overfit)

Answer 227

speeds convergence by smoothing gradient updates

Answer 228

adaptive learning rate using momentum and RMS

Answer 229

adjusts learning rate dynamically

Answer 230

split text into tokens (Bloomberg AI rocks -> bloomberg, ai, rocks)

Answer 231

vector representation of words

Answer 232

Word2Vec, GloVe, BERT embeddings

Answer 233

maps input -> output sequences

Answer 234

focus on relevant parts of input

Answer 235

identify entities(Person, Org, Location)

Answer 236

predict polarity (positive/negative)

Answer 237

combines retrieval and generation for factual accuracy (ex: chatbots that search documents)

Answer 238

completeness, accuracy, consistency, uniqueness, timeliness

Answer 239

same information stored repeatedly, changing one record requires updating many others, can’t add new data because other data is missing, deleting a record unintentionally removes important info, and poor query performance — wide tables with repeated text strings slow down queries

Answer 240

Each cell holds a single value (no lists or nested data)

Answer 241

Every non-key column depends on the entire primary key

Answer 242

No transitive dependencies (non-key attributes shouldn’t depend on other non-key attributes)

Answer 243

df.describe()

Answer 244

df.sort_values("GPA", ascending=False)

Answer 245

df["major"].unique()

Answer 246

df.isna().mean() * 100

Answer 247

df["GPA"].std()

Answer 248

df["GPA"].quantile([0.25, 0.5, 0.75])

Answer 249

df["GPA"].hist(bins=10)

Answer 250

df.boxplot(column="GPA")

Answer 251

df["major"].value_counts().plot(kind="bar") # Bar chart

Answer 252

Q1 = df["GPA"].quantile(0.25) Q3 = df["GPA"].quantile(0.75) IQR = Q3 - Q1 outliers = df[(df["GPA"] < Q1 - 1.5*IQR) | (df["GPA"] > Q3 + 1.5*IQR)]

Bloomberg 1st Technical Flashcards

(282 cards)