Build a Career in Data Science
by Arpon Sarker
Introduction
It is just less than 2 weeks before I start my first role in my profession as a Junior Data Scientist. I am incredibly grateful for this rare opportunity and found myself a little depressed in not being considered for graduate programs (although I was late for applications). This book - Build a Career in Data Science - is particularly helpful and every page is completely relevant to this stage in life where I start in the data science profession. This book goes through acquiring the skills and knowledge, applying and going through the interview process, and starting out data science as your first job and what to progress in or how to leave.
What is Data Science? It is the practice of using data to try to solve and understand real-world problems. It combines mathematics/statistics, databases/programming (R/Python for analysis, SQL for databases, and Git for version control) and business understanding.
Types of DS Jobs
There are 3 different types of data science jobs:
- Analytics (Analyst): takes data, formats and arranges it and delivers it to others; takes a lot of data preparation but less interpretation; creates dashboards and reports
- ML (ML Engineer): develops ML models and puts them into production where they run continuously; less visualisations; all work output is for machine consumption (e.g. APIs)
- Decision Science (Decision Scientist): turns company’s raw data into information to help make decisions; plenty of programming - only needs to be run once for analysis so code is allowed to be inefficient and difficult to maintain; creates analses the produce recommendations; A/B testing
Other roles include BI analyst (similar to analyst but uses less statistics and programming), data engineer (keeping data maintained in databases and ensuring that people get the data they need - no reports, analyses or models), research scientist.
Data Science Companies
- Massive Tech Company
- Established Retailer
- Early-stage Startup
- Late Stage Successful Tech Startup
- Grant Government Contractor
TIP: make friends with people with domain knowledge to know your data.
Building a Portfolio
Your repository could include an analysis, model, explaining a statistical technique or a tutorial
-
dataset -> question -> analysis -> repo/blog
-
question -> dataset -> analysis -> repo/blog
Other projects:
- Kaggle datasets
- datasets in news (fivethirtyeight)
- APIs
- Government open data
- Web Scraping (check robots.txt file to see what is allowed, make sure to build time in between requests)
The Search
Use meetups to find people and try and ask for informational interviews to learn more about the role, industry or company.
First Months on the Job
- Fight instinct to get as much work done as possible
- Ask questions to complete tasks the right way
- Learn rhythm of peers OR Build my own processes
-
Meeting with manager to discuss priorities (should I provide analysis to stakeholders or make high-performing models?)
-
Models are based on usefulness, level of insight and maintainability not just high performance
- Regular meetings with direct supervisors to see if I’m meeting expectations
- Should I drop my work to help a colleague or focus on deliverables? Is it okay to ignore Slack messages to finish project?
- Figure out who determines result of performance reviews
- peers or manager?
- come up with a matrix of areas I’m evaluated on
- review after first 3 months
-
Start reading reports employees have written and see how complex the reports are
- Find where data lives and get access to it
- What is the table and data system?
- Read documentation on the table and summary statistics
- Talk to an expert on the table - data scientist OR collector of data
- Learn how data got to me
- Write down “gotchas” in data and make a map of where everything lives
- You don’t have to prove yourself so quickly
Make sure you ask better questions:
- Learn from observations about the question culture
- Show that I’ve been proactive
- Find experts and be thoughtful about their time
- Make a list
Make sure to build relationships:
- Ask manager for a list of people I should get to know
- introduce myself to skip-level boss
- befriend office managers
- Find a mentor or a sponsor (someone who gives people opportunities) and make sure to update them on how I’ve followed their advice
Making an Effective Analysis
A good analysis:
- answers the question
- made quickly (approx. 1 month)
- can be shared (PPT not R/Python)
- self-contained
- can be revisited
For machine learning engineers, analyses share how well models perform and show the value in building a new model or how models change over time.
Analysis Process:
- The Request
Convert the business question to data science question and then take the data science answer and return the business answer. The foundational knowledge includes ‘who’s requesting the analysis?’, ‘what is the motive?’, ‘what is the request?’, ‘what is the decision being made?’, ‘do we have the required data?’
- The Analysis Plan
Before looking at the data, write down everything I plan to do with the data. Make it actionable. Make sure to structure the analysis plan to be able to reuse code in different sections.
Template:
- TOP: Title, who I am (if shared), objective
- SECTIONS: general topic in the analysis
- self-contained
- FIRST LEVEL OF SECTION LISTS: each question that was posed
- SECOND LEVEL OF SECTION LISTS: actual tasks that can be checked off
Ask for approval by the manager/stakeholder.
- Doing the Analysis
- Importing and cleaning the data does not look productive to nontechnical people; get to data exploration quickly; spend as little time as possible on anything that won’t be needed and vice versa; talk to stakeholders if data is weird
- For data exploration and modeling, use general summarisation and transformation for simple analysis work
- Visualise data or create summary tables (use Git and save old code since you can go back and forth between visualisations or summarisations)
- Create a model as needed and isolate the code from general analysis code
Repeat these points for each point on the analysis plan.
Continuously maintain level of polish to be able to share progress with stakeholder.
- Wrapping it Up
- Use a narrative for the final presentation
- Mothballing your work:
- double-check if I can rerun whole analysis (should take 1 click)
- comment code
- README file
- store code securely (Github)
- ensure data stored safely (S3)
- output stored in shared location
TIP: Use company colour pallette as theme
Deploying a Model into Production
“Deploying to production” means putting code on system that allows it to run continuously. Production machine learning models are models that work in near real time to make predictions OR classify something based on provided data. These models should be able to handle special weird cases to not crash the environment and should be maintainable (retraining on newer data and able to monitor its performance).
- Building ML Model
Use the same steps as the analysis. The model will be converted to format that other programs can use (APIs). It should be deployed into test environment first. Consider what data will be needed in real time when the model is run and understand that model performance is business margins.
- How to Deploy in Production?
Use REST APIs which are tiny websites that returns data instead of HTML (uses HTTP). Using Python, you can set this up with Flask (R uses Plumber). To input information to the model, you can do this inside the URL by adding a unique ID OR include it in the body of the request.
Make sure to create documentation on the API’s design such as endpoint URLs, what needs to be included in the request, format and content of response, why it was created, and the requirements API needs to run and how to install it elsewhere. This can be formatted as an OpenAPI document.
Make sure to conduct unit testing by testing each endpoint and individually test functions under different conditions.
Now we need to move the code to a server to continuously run (which could be on the Cloud). There are 2 ways to do this:
- Virtual Machines:
- Install R/Python, install libraries, copy code to it and then run
- Docker:
- Do the same as virtual machines
- Why Docker? VMs take a lot of space (have to contain everything a regular computer has) and they are annoying to set up and difficult to document and replicate. Docker specifies how the machine is set up and the shared specification across distinct machines means you can share resources.
Having to move code manually is error-prone, the solution is CI/CD tools.
Continuous Integration (CI) is having code be recompiled (not necessary for Python) automatically every time it is committed to a repository. This includes doing unit tests which is completely relevant. Continuous Deployment (CD) is taking the output of the CI tool and automatically deploying it in the production system.
Check repo for changes -> Run build process (unit tests) -> move resulting code to VM
Make sure to conduct a load test: run at least twice as many API requests I expect to see and see if it still works.
- Keeping System Running
The DevOps team makes sure API is always running. This is done through monitoring (recording internal issues - errors), telemetry (recording events - requests and responses made), and alerting (automatic emails or Slack messages).
Make sure to retrain the model since the data it was trained on will become old. I could either repeat the steps I’ve followed OR automate retraining (which I still have to keep an eye on manually). Both of these methods require a standard schedule to do this every $n$ weeks or months. This can be automated in AWS SageMaker.
Working with Stakeholders
- Business
- if facts of data science called into question, help them understand what and how you did it
- Engineering:
- take extra care to communicate process so engineers will be less surprised (can’t handle the ambiguities of DS)
- Corporate Leadership:
- can gain leverage for the DS team
- My Manager:
- I can be vulnerable and open up about work troubles
Make sure to understand their goals (“what’s important to you?”), communicating constantly, being consistent (standardisation of work), and creating a relationship.
Why Data Science Projects Fail
- Data isn’t what I wanted
- Data doesn’t have a signal (no relationship in data to make a prediction - if simple models can’t find signal then complex ones can’t either)
- Customer didn’t end up wanting it
Make sure to manage risk through multiple projects and early stoppping points.
When a project fails make sure to document lessons learned, potentially pivoting, allowing myself to fail and looking at data scientists as treasure hunters not architects.
Joining the Data Science Community
Grow your portfolio through blog posts and projects, make sure to attend conferences (I will most likely require company support - sick leave and budget), giving talks (look for calls for proposals (CFPs)) and contribute to open source (open source sprints are hackathons for open source).
Leaving your Job Gracefully
Leave when there is nothing more to learn. Make sure you have a job on standby before you leave but this doesn’t give you the required break in between jobs and you will have to start immediately. To be able to this requires a data science network. Make sure to be selective about where you’re applying and don’t give notice by email.
Moving up the Ladder
There are three main pathways:
- management (no DS work)
- principal data scientist (technical lead)
- independent consulting (risky but allows freedom)