Data Science: A Realistic Perspective on Challenges and Insights
Written on
Chapter 1: Understanding the Reality of Data Science
As a data scientist with experience in software engineering and various side projects, I've often found myself in the role of a mediator between engineers and data professionals. It's impossible to ignore the existing tensions between these two groups. In this article, I'll share my mixed feelings about data science and the profession itself. By the end, you can expect to gain:
- 5 Key Insights on Data Science
- 15 Practical Tips for Building a Cohesive Data Science Team
- A sprinkle of my dry humor.
Realization 1: The Importance of Hypothesis in Data Science
The portrayal of data science by tech giants and sensational online content can be misleading. It's essential to recognize that data science encompasses more than just model tuning or selecting algorithms. Recall the scientific method from school; that's the essence of what "science" signifies in data science.
Data scientists analyze datasets alongside business challenges, crafting experimental designs to meet objectives. It's a common pitfall to apply various models without a foundational hypothesis. This approach is not only inelegant but also inefficient. When we eventually identify a model with acceptable performance metrics, we often scramble to understand why it succeeded.
Consider the sudden surge in Bitcoin's price, exceeding $50,000 in February 2024. Could it simply be a coincidence with the established support and resistance lines? Or is there an underlying market psychology that we overlooked due to our failure to hypothesize the motivations behind price movements?
Lesson 1: Without a hypothesis, we risk misinterpreting correlation for causation.
Lesson 2: Lacking a hypothesis may lead to retrofitting our modeling logic, resulting in potential bias.
Lesson 3: Always seek documentation from your data science peers regarding the rationale behind any models ready for deployment.
Realization 2: The Necessity of Clean Code
I have a love-hate relationship with Jupyter Notebooks. While they are fantastic for experimentation, they can also become chaotic and cluttered. Unfortunately, these disorganized notebooks sometimes become immediate outputs in quick projects, which is concerning.
Who is responsible for tidying up this chaos? "Hey, engineer, we have a training notebook; you can use that for production." But how should the model be utilized? What steps are needed to replicate the training process? How are we managing version control?
Data scientists often take for granted that engineers will sort these issues out, which can complicate matters for everyone. Imagine adjusting a log-transform function from base 10 to natural logarithm in one of your training notebooks. Good luck expecting others to recognize that change in production data pipelines. Is it the engineers' fault if something goes wrong, or is it the data scientist's responsibility for not documenting the intended use of deliverables?
Now, think about revisiting that notebook two months later for model performance evaluation. Wouldn't it be simpler if we took the time to organize our work?
Lesson 4: Unclear historical code is problematic, especially when created by others. Always deliver your work as if you might need to inherit it later.
Lesson 5: Show empathy; not everyone possesses the same skills as data scientists. If you modify the training pipeline to enhance performance, document and share your findings with your team.
Lesson 6: Keep a comprehensive inventory of documentation and resources from your data scientists. Ensure you review this before proceeding to production.
Realization 3: Embrace DataOps, MLOps, and Data Engineering
Your models may perform excellently, but if the training process only runs on your local machine, you're creating a bottleneck. Even the Romans recognized that merely drawing water from wells would not suffice; they invented aqueducts and built an empire!
Just as developers learn DevOps principles, data scientists should familiarize themselves with MLOps and data engineering. Rather than dismissing these concepts as mere plumbing, see them as essential tools.
Understanding data engineering helps clarify the requirements for both upstream and downstream tasks, while MLOps and DataOps facilitate automated benchmarking and monitoring of models and datasets. By acquiring this knowledge, you can enhance your value in any tech team and position yourself as a versatile tech lead.
Lesson 7: Continuously automate your processes; otherwise, you risk becoming a bottleneck.
Lesson 8: Familiarize yourself with related fields; it's common sense across all industries.
Realization 4: Open Collaboration is Key
Perfectionism can be a common trait among tech professionals. While we strive to improve our code, it's crucial to acknowledge that a model that doesn't receive real-world traffic is ultimately useless.
What are the benefits of keeping development hidden? If business stakeholders inquire about model statuses only to find that the model exists but isn't live, the engineering team may unfairly bear the blame for the delay in deployment.
As data scientists, we have an array of tools at our disposal. Hiding models behind closed doors is ultimately counterproductive. We can adopt a project management approach by outlining what success looks like for each experiment—be it a 10% increase in f1 score or a 5% reduction in inference latency. Once you define your goals, evaluate your results. If you meet your target, push your model to production and take credit for your work!
We can also consider release management strategies. Acknowledging that models can always be refined and that data and concept drift are common challenges, we could utilize canary deployments or A/B testing to mitigate risk.
In essence, a model that never sees live traffic is akin to an unfulfilled side project—both are devoid of meaning.
Lesson 9: Concealed development wastes resources and breeds friction between technical teams.
Lesson 10: Prevent your team from falling into an endless cycle of experimentation that fails to produce actionable results.
Lesson 11: Establish clear goals for each experiment and automate release decisions.
Lesson 12: If you're uncertain about your model's performance, consider alternative release strategies rather than keeping it hidden.
A Quick Note for Readers
If you've made it this far, you or your team might be interested in the fundamentals of data science. Check out my other blogs where I break down various machine learning algorithms!
Realization 5: The Limits of Automation in Data Science
While large language models (LLMs) are remarkable and models are evolving to be multi-modal, it's crucial to recognize that data science isn't fully automatable—at least not yet.
AutoML and LLMs can simultaneously assess multiple factors during feature engineering, but the metrics still require review from data product owners before any model can be deployed.
For example, if our target is to achieve a 10% increase in f1 score, is that goal sustainable long-term? What happens when we reach a 97% f1 score? Should we still pursue marginal gains? Or what if a model achieves a 96.9% f1 score but is far more interpretable?
Without data scientists, we risk relying on engineers to generate innovative ML solutions. If those engineers lack data and ML literacy, managed and LLM-driven solutions may dominate the data science landscape.
These solutions can be categorized as follows:
- Managed solutions facilitate rapid prototyping, such as uploading documents to ChatGPT without needing to navigate complex frameworks.
- AutoML aids in decision-making by trying various approaches and presenting the results for selection.
- LLMs respond to user prompts or training data, providing solutions based on similar problems they've encountered.
While I appreciate these advancements, I believe many business challenges can be addressed using a combination of these tools. Yet, it's essential to discern when you've maximized the potential of managed solutions and need to refine the embedding function or explore niche markets that LLMs may not adequately cover.
Lesson 13: Empower your technical team by combining managed solutions, AutoML, and LLMs, potentially reducing the need for a dedicated data science function.
Lesson 14: Full-stack data and ML-savvy engineers are rare gems. If an engineer claims they can fully automate data science, they are either exceptionally skilled or you might need to reassess your expectations.
Lesson 15: Combining managed solutions, AutoML, and LLMs makes insights accessible. For thought leaders and service providers, delivering high-value insights is essential to remain relevant.
If you've engaged with this content, I hope my reflections and insights resonate with you. This piece differs from my usual technical posts, focusing instead on broader themes. If you found value in this style or have differing opinions, feel free to reach out or comment. Let's foster a constructive dialogue!
Until next time, this is Louis.
Note: The views expressed here are solely my own and do not represent any other individuals or organizations.
Explore the harsh realities faced by data scientists and the challenges of the profession.
Discover why pursuing a career in data science might not be the best choice for everyone.