Data wrangling consists of…
Data preparation, Data Munging, Data Transformation
What is the curse of dimensionality and how to solve it?
Wrangling Stages
a. Raw: data ingestion and discovery — “unboxing”
- Data unboxing
What data do I have? What do I want to do with it?
Basic tools
- UNIX command line
- SublimeText editor
Trifacta: free visual data wrangling tool
- Codifies some good practices you can also follow by hand
Python’s Pandas library
b. Refined: curating data for reuse
- What: data warehousing, canonical models
- Who: data curators, IT engineers, actuaries …
c. Production: Ensuring feeds and workflows
- What: recurrent, automated use cases
- Who: often involves SW engineers and IT/ops folks
Data Wrangling issues