A Journey into ... Data Engineering?🤔
“Data is the new oil." — Clive Humby
Since my last published article, quite some time has passed, but the theme for this year is Committing to Competence, which means promising more publications in the pipeline. Exploring various sections, different from backend, for the past few months such as Frontend, Cloud, and DevOps, the desire to dig deeper into the field was not burning as bright, only enough to gain surface knowledge - except Frontend: VueJS, Nuxt and its ecosystem is a wonder - until I stumbled upon a video by the Seattle Data Guy on how he became a data engineer and a burning fire lit my interest to explore the new field.
Given the enticing nature of the domain, I delved extensively to broaden my understanding. My journey commenced with these resources from my friends:
Samuel Abolo, a Software Engineer, curated a personalized roadmap to get into Data Engineering, and
Damilare Akin-Oladejo, a Machine Learning Engineer, recommended a Data Engineering Zoomcamp playlist by DataTalksClub.
With a Software Engineering background, and a solid understanding of Python, and databases, this journey looks promising, intriguing and exciting.
One would think, what is this Data Engineering? After reading some articles and books, two definitions stood out:
- "Data engineering is a set of operations aimed at creating interfaces and mechanisms for the flow and access of information" by AlexSoft in Data Engineering and Its Main Concepts, and
- "Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning" by Joe Reis and Matt Housley, Fundamentals of Data Engineering.
These two definitions highlight Data Engineering using the following concepts:
- set of operations: development, implementation, and maintenance of systems and processes and
- data control and governance: this involves handling, transforming, and utilizing data to ensure coherence, accuracy, and accessibility. Refining raw data into usable, refined information for various applications and downstream users.
Simply put, Data Engineering is using a set of operations or actions to receive, transform, refine and manipulate data to guarantee its quality and usability. This ensures efficient data utilization across various future needs and applications.
To manage data as a data engineer, one needs to be familiar with the Data Engineering lifecycle:
Data Engineering Lifecycle by Fundamentals of Data Engineering
The lifecycle depicts the movement of data from its origin or source, to its storage (sitting at the base because it occurs throughout the lifecycle as data flows from beginning to the end), transformation and finally the main goal, serving the refined data to users e.g. data analysts, data scientists, and machine learning engineers.
Data Engineering is also an intersection of various fields in technology such as:
- Security
- Data management
- DataOps
- Data Architecture
- Orchestration and
- Software Engineering
These are also called "Undercurrents of Data Engineering" i.e. critical ideas across the entire Data Engineering lifecycle.
Data engineers are highly sought after. In the 2021 Data Science Interview Report, not only has Data Engineering-specific interviews increased by 40%, but Data engineering has become the new cool kid on the data block. There is a shortage of job applications into data engineering roles according to a free guide to becoming a data engineer. An average of 2.56 applicants to the data engineering field, compared to 4.76 applications for Data Science.
Companies are looking for data engineers with various tools in their toolbelts to perform different functions. Data structures and algorithms, programming languages such as Python and Java, Cloud technologies such as GCP, AWS or Azure, the formidable knowledge of SQL and NoSQL (PostgreSQL, MongoDB etc), Data warehouses, and applications like Google BigQuery and its equivalents are a few tools required by companies to hire data engineers.
To gain more experience in the data engineering field, I:
- took an introductory course to learn about the concepts, terminologies, and basic skills required.
- watched an opinionated YouTube curated playlist to explain in-depth topics such as what to do, what not to do, skills to have, and a dedicated roadmap, and
- am currently reading the Fundamentals of Data Engineering. This book introduces data engineering foundations, Data and Data Engineering life cycles, and concepts such as data generation, ingestion, orchestration, and governance. It also states what it does not teach which is teaching data engineering using a particular tool, technology or platform.
In conclusion, data engineering is a wide and interesting field to explore, there are both similarities and differences between Data Engineering, Data Science and Data Analysis but all three are not the same. I plan to write and build more after gaining a reasonable amount of knowledge. By the way, DataTalksClub is starting a new Zoomcamp cohort, you can find more information here. My next step is to take the DataTalksClub Zoomcamp and document my progress. Until next time, always remember: “Data is the new oil." — Clive Humby