Transforming SQL Management: Emulating Software Engineering Practices
Written on
Chapter 1: The Importance of SQL as Code
In my two-year experience at a prominent tech firm in Silicon Valley, I've noticed a significant parallel between how Data Engineers handle SQL and how Software Engineers approach coding. This mindset can be adopted by businesses of all sizes to enhance their data strategies. In this piece, I will explore how treating SQL like code can yield advantages and offer specific recommendations for organizations.
SQL functions as a query language, so why should it be treated with the same care as code? Much like object-oriented programming, crafting SQL can be time-consuming, complex to debug, challenging to comprehend, and susceptible to version control issues. Additionally, SQL is instrumental in developing data pipelines. When these pipelines encounter issues, it's crucial that they can be easily debugged and repaired. Thus, integrating code centralization into your data strategy is essential.
Utilizing code management tools allows for clear tracking of who has modified or maintained a given SQL script, enabling easy identification of changes across related queries. This accelerates the process of locating failing commits, reverting alterations, or implementing necessary fixes. Once SQL code is committed, it is immediately deployed to the development environment, allowing for swift execution of development pipelines to identify and resolve failures. Regular testing releases ensure that the code is likely to pass production standards. After successful testing, a production release of all SQL changes since the last deployment occurs with minimal issues.
What can smaller organizations do?
Take cues from your software engineering culture, which likely has a more robust framework. Begin to understand and incorporate tools used by software teams, such as Git and IDEs. Transition your scripts from local storage to centralized locations, and aim to eliminate views, materialized views, and stored procedures.
Section 1.1: Implementing Tooling for SQL Management
Large organizations manage the majority of their code in extensive, centralized repositories. When modifications are needed for any SQL or when a new script is created, a change list is generated similar to a pull request. This must be approved by another engineer after testing. Once the reviewer gives the green light, the author can commit the changes to the repository. While change control is standard, a notable aspect in larger firms is the emphasis on code formatting. Well-formatted code significantly reduces the time needed to comprehend, debug, and alter another author's work. Many companies with established engineering cultures have mechanisms that automatically reject changes that do not adhere to coding standards.
What can smaller organizations do?
Select a code repository to consistently use, ideally one shared with the engineering team. Centralize all SQL code in this repository, including functions related to Data Engineering, Analytics, and Business Intelligence. Start adopting code formatting standards, utilizing open-source tools to enhance code readability and maintainability. Dedicate time to run existing code through a formatting tool and enforce compliance with these standards for future submissions. In a short span, Data Engineers will likely embrace these standards, making SQL easier to read, write, and understand.
Subsection 1.1.1: The Role of Version Control and Environments
Without version control, frequent changes can lead to regressions that are hard to fix. When code that disrupts a pipeline is submitted, identifying the specific change for rollback can be tricky. This principle aligns with code integration practices. In a well-structured environment, undesired changes can be handled without disrupting operations. SQL changes immediately impact the development environment upon submission, allowing for early failure detection. On rare occasions, changes that succeed in development may still fail in production due to various factors, which is why a pre-production testing environment can be beneficial.
What can smaller organizations do?
At a minimum, establishing a development environment is crucial. Testing code within your organization’s data infrastructure will help minimize failures. Open-source tools like DBT provide a manageable layer of abstraction, allowing each table to exist as both a development and production table. Regular release cycles can then promote all submitted code to production.
Chapter 2: Embracing Open Code Access
Large companies with diverse engineering requirements utilize vast code repositories to house nearly all their code. It can be challenging to keep track of all users of a product. For instance, a Software Engineer working on a production application might not be aware of the downstream effects of their changes without a centralized codebase. Centralization allows for easy searches of scripts, queries, and applications that may rely on their outputs, facilitating communication and collaboration with other engineers to ensure synchronous changes. A lack of transparency in code can lead to fragmented development. While some repositories for sensitive projects may not be openly accessible, these instances are rare and usually reserved for extreme cases. Companies that operate on a large scale typically design their code architecture with an emphasis on trust, and yours should too.
What can smaller organizations do?
Foster a culture of trust and communication regarding your codebase and repositories. Your projects don't need to be as secretive as you might think from an engineering standpoint. Avoid hiring engineers who you don't trust to access the entire codebase. Encourage collaboration between Software Engineering and Data Engineering, and develop processes to address potential downstream consequences before code is deployed to production.
The first video titled "SQL Basics with Healthcare Data | 1 Hour" offers an overview of SQL fundamentals, specifically tailored for healthcare data applications. It covers essential concepts and practical examples to enhance your SQL skills.
The second video titled "SQL Essentials: An Introduction to Writing SQL Queries" provides a foundational understanding of writing SQL queries. It is designed for beginners looking to establish a solid grounding in SQL.