Hacker theme

Hacker is a theme for GitHub Pages.

Download as .zip Download as .tar.gz View on GitHub
20 January 2026

Build a Career in Data Science

by Arpon Sarker

Introduction

It is just less than 2 weeks before I start my first role in my profession as a Junior Data Scientist. I am incredibly grateful for this rare opportunity and found myself a little depressed in not being considered for graduate programs (although I was late for applications). This book - Build a Career in Data Science - is particularly helpful and every page is completely relevant to this stage in life where I start in the data science profession. This book goes through acquiring the skills and knowledge, applying and going through the interview process, and starting out data science as your first job and what to progress in or how to leave.

What is Data Science? It is the practice of using data to try to solve and understand real-world problems. It combines mathematics/statistics, databases/programming (R/Python for analysis, SQL for databases, and Git for version control) and business understanding.

Types of DS Jobs

There are 3 different types of data science jobs:

Other roles include BI analyst (similar to analyst but uses less statistics and programming), data engineer (keeping data maintained in databases and ensuring that people get the data they need - no reports, analyses or models), research scientist.

Data Science Companies

TIP: make friends with people with domain knowledge to know your data.

Building a Portfolio

Your repository could include an analysis, model, explaining a statistical technique or a tutorial

  1. dataset -> question -> analysis -> repo/blog

  2. question -> dataset -> analysis -> repo/blog

Other projects:

Use meetups to find people and try and ask for informational interviews to learn more about the role, industry or company.

First Months on the Job

Make sure you ask better questions:

Make sure to build relationships:

Making an Effective Analysis

A good analysis:

For machine learning engineers, analyses share how well models perform and show the value in building a new model or how models change over time.

Analysis Process:

  1. The Request

Convert the business question to data science question and then take the data science answer and return the business answer. The foundational knowledge includes ‘who’s requesting the analysis?’, ‘what is the motive?’, ‘what is the request?’, ‘what is the decision being made?’, ‘do we have the required data?

  1. The Analysis Plan

Before looking at the data, write down everything I plan to do with the data. Make it actionable. Make sure to structure the analysis plan to be able to reuse code in different sections.

Template:

Ask for approval by the manager/stakeholder.

  1. Doing the Analysis
    • Importing and cleaning the data does not look productive to nontechnical people; get to data exploration quickly; spend as little time as possible on anything that won’t be needed and vice versa; talk to stakeholders if data is weird
    • For data exploration and modeling, use general summarisation and transformation for simple analysis work
    • Visualise data or create summary tables (use Git and save old code since you can go back and forth between visualisations or summarisations)
    • Create a model as needed and isolate the code from general analysis code

Repeat these points for each point on the analysis plan.

Continuously maintain level of polish to be able to share progress with stakeholder.

  1. Wrapping it Up
    • Use a narrative for the final presentation
    • Mothballing your work:
    • double-check if I can rerun whole analysis (should take 1 click)
    • comment code
    • README file
    • store code securely (Github)
    • ensure data stored safely (S3)
    • output stored in shared location

TIP: Use company colour pallette as theme

Deploying a Model into Production

“Deploying to production” means putting code on system that allows it to run continuously. Production machine learning models are models that work in near real time to make predictions OR classify something based on provided data. These models should be able to handle special weird cases to not crash the environment and should be maintainable (retraining on newer data and able to monitor its performance).

  1. Building ML Model

Use the same steps as the analysis. The model will be converted to format that other programs can use (APIs). It should be deployed into test environment first. Consider what data will be needed in real time when the model is run and understand that model performance is business margins.

  1. How to Deploy in Production?

Use REST APIs which are tiny websites that returns data instead of HTML (uses HTTP). Using Python, you can set this up with Flask (R uses Plumber). To input information to the model, you can do this inside the URL by adding a unique ID OR include it in the body of the request.

Make sure to create documentation on the API’s design such as endpoint URLs, what needs to be included in the request, format and content of response, why it was created, and the requirements API needs to run and how to install it elsewhere. This can be formatted as an OpenAPI document.

Make sure to conduct unit testing by testing each endpoint and individually test functions under different conditions.

Now we need to move the code to a server to continuously run (which could be on the Cloud). There are 2 ways to do this:

Having to move code manually is error-prone, the solution is CI/CD tools.

Continuous Integration (CI) is having code be recompiled (not necessary for Python) automatically every time it is committed to a repository. This includes doing unit tests which is completely relevant. Continuous Deployment (CD) is taking the output of the CI tool and automatically deploying it in the production system.

Check repo for changes -> Run build process (unit tests) -> move resulting code to VM

Make sure to conduct a load test: run at least twice as many API requests I expect to see and see if it still works.

  1. Keeping System Running

The DevOps team makes sure API is always running. This is done through monitoring (recording internal issues - errors), telemetry (recording events - requests and responses made), and alerting (automatic emails or Slack messages).

Make sure to retrain the model since the data it was trained on will become old. I could either repeat the steps I’ve followed OR automate retraining (which I still have to keep an eye on manually). Both of these methods require a standard schedule to do this every $n$ weeks or months. This can be automated in AWS SageMaker.

Working with Stakeholders

  1. Business
    • if facts of data science called into question, help them understand what and how you did it
  2. Engineering:
    • take extra care to communicate process so engineers will be less surprised (can’t handle the ambiguities of DS)
  3. Corporate Leadership:
    • can gain leverage for the DS team
  4. My Manager:
    • I can be vulnerable and open up about work troubles

Make sure to understand their goals (“what’s important to you?”), communicating constantly, being consistent (standardisation of work), and creating a relationship.

Why Data Science Projects Fail

  1. Data isn’t what I wanted
  2. Data doesn’t have a signal (no relationship in data to make a prediction - if simple models can’t find signal then complex ones can’t either)
  3. Customer didn’t end up wanting it

Make sure to manage risk through multiple projects and early stoppping points.

When a project fails make sure to document lessons learned, potentially pivoting, allowing myself to fail and looking at data scientists as treasure hunters not architects.

Joining the Data Science Community

Grow your portfolio through blog posts and projects, make sure to attend conferences (I will most likely require company support - sick leave and budget), giving talks (look for calls for proposals (CFPs)) and contribute to open source (open source sprints are hackathons for open source).

Leaving your Job Gracefully

Leave when there is nothing more to learn. Make sure you have a job on standby before you leave but this doesn’t give you the required break in between jobs and you will have to start immediately. To be able to this requires a data science network. Make sure to be selective about where you’re applying and don’t give notice by email.

Moving up the Ladder

There are three main pathways:

tags: career