Here is a list of handpicked technical interview questions with progressive difficulty, for screening data scientist / data analyst candidates. They cover all fundamental skills of a data scientist. Each question is annotated with the difficulty level, its purpose and the expectations for the answer.
I have used the questions in the past interviews with considerable success. Feel free to use my collection to build your own custom list of interview questions, by following this guide.
The list here does not include any programming-related questions. As there is usually a coding test to evaluate the candidate’s coding skill (usually with Python / R coding and SQL-related questions). I find that most companies don’t require data scientists to be well-versed in software engineering fundamentals. They just need them to have enough programming skill to start working with their data.
Most organizations hire a data scientist either
- to analyze the data produced in the organization,
- to develop some kind of predictive model to meet business needs, e.g. increase sign-ups, reduce churn, etc,
- to assist with the development of new product features that employ data science, e.g. create an algorithm for a recommendation engine.
Data scientist usually works closely in the business domain, and solves business problems. Therefore it is advisable to include one or two domain-specific questions in the interview, to test the candidate’s fact-finding, critical thinking and communication skill.
A car insurance company is hiring a data scientist. They may set the following question:
If every car is implanted with speed tracker so that one can track its driving state. What kind of business questions can be answered with the data?
A job portal is developing a recommendation engine:
How will you design a job recommendation algorithm for our website?
As you can see, these questions are open-ended. The candidates will very likely not know the correct answer, having little domain-specific knowledge. The correctness of the answer is not important here. Do observe whether the candidate can ask the right questions, get the relevant facts from the interviewer, suggest some ideas to solve the problem and argue in favor of what he thinks is the best approach.
Some company prefers to ask Fermi questions (a type of question that seeks a rough estimate for something that is difficult to measure in reality) for this purpose. I am not a fan of Fermi questions because you might end up with a candidate who is quick-witted but is not systematic in his problem-solving approach.
How many big Macs does McDonald sell every year in US?
Machine learning is the hottest skill on the planet now. A data scientist career usually involves using machine learning to solve problems.
- (entry-level) How many unsupervised/supervised learning algorithms do you know? When do you use each algorithm?
- How will you compare the results of various machine learning algorithms?
- There are many machine learning algorithms out there. Some are more suitable than the others for certain learning tasks. More senior candidates will know more.
- In 1st Q, expect the candidate to discuss when he chooses to use a certain algorithm, depending on the input data size, the variable to predict, etc.
- In 2nd Q, expect the candidate to explain the techniques for comparing the performance of trained models.
- (intermediate) What are the problems related to overfitting and underfitting? How will you deal with these?
- What is bias-variance trade-off?
- Why would one split the data into training/validation/test set in machine learning?
- If a candidate does not know these concepts, he will run into the problems of over-fitting and under-fitting, or create a model that has poor generalization power over unknown data.
- Expect the candidate to have a firm grasp of these concepts, and be able to discuss various relevant techniques such as cross-validation, bagging, etc.
- What is multi-collinearity? How to solve it?
- A data scientist should also know how to recognize the parameter instability in a model and solve any data issues that may have caused it.
- Take the discussion into feature extraction and dimensionality reduction, to see how familiar the candidate is, with both techniques, which are important in ML.
- (advanced) Explain <specific feature, technique> in <specific machine learning algorithm>?
- If the job requires the understanding of a specific field of machine learning such as deep learning, it is worthwhile to inquire the candidate further of his/her experience on the field.
- e.g. what is the purpose of ReLU layer in a convolutional neural network? How to choose a learning rate in gradient descent?
Statistics and Probabilities
Statistics is another essential skill of data scientists. It is used to make simple inference about the underlying process from the data. It is also used in A/B testing or machine learning, to test for statistical significance of a result.
- (entry-level) What is the best strategy for playing the Russian roulette?
- There are many probability questions set in data scientist interviews at the top tech companies. The purpose of these questions is to see whether the candidate knows basic probability calculations and can reason quickly. However, I do not favor most of them. I deemed those too difficult to solve in an interview setting, given the time limitation.
- This question is something that does the job, is not too difficult and is interesting enough to serve as an ice-breaker.
- Expect the candidate to work out the probabilities for a two-player game, with no shuffling of chambers in-between the plays.
- You are about to get on a plane to Seattle. You want to know whether you have to bring an umbrella or not. You call three of your random friends in Seattle and ask each one of them if it’s raining. The probability that your friend is telling the truth is 2/3 and the probability that they are playing a prank on you by lying is 1/3. If all 3 of them tell that it is raining, then what is the probability that it is actually raining in Seattle? (Assuming if it rains in Seattle, it rains everywhere in Seattle.)
- An alternative probability question, from Facebook.
- Expect the candidate to give a quick inference from the information.
- More experienced candidate will apply Bayesian theorem to get an answer that depends on the prior probability of seeing a rain in Seattle. He/she may require a white-board to write the answer.
- (intermediate) How will you test a new website feature to determine if the change works positively to increase the conversion rate?
- This question is about the application of statistical hypothesis testing in A/B testing.
- The actual set up of A/B test is not important (the engineers will take care of that). Expect the candidate to discuss the suitable test statistics to be used and the steps for hypothesis testing.
- (advanced) How will you test a new website feature to determine if the change works positively to reduce the churn rate?
- An alternative form of the above question with a slight twist: a user churns if his account is inactive after a time period.
- (entry-level) What publicly available datasets have you used in the past to improve your model?
- This question is to see if the candidate often takes initiative to find and play with public data so as to solve the problem at hand or learn new skills.
- Expect the candidate to mention a few famous public repositories, or some dataset he/she has played with, during learning.
- How will you treat missing information in a dataset?
- It is very common to run into data issues in data modeling. Data scientists actually spend a great deal of their time cleaning their data.
- Expect the candidate to know the various practices of treating missing data.
- (intermediate) When do you reduce/not reduce the dimensionality in a dataset? How?
- Another question about preparing the data for modeling.
- Expect the candidate to know how to remove noisy, faulty data columns to improve model performance.
- If the job scope is in computer vision, image recognition fields, expect the candidate to discuss Principal Component Analysis.
Data scientists often need to work with the existing data infrastructure of the organization, such as Hadoop / Spark or NoSQL databases like Cassandra and Neo4j. If this is a requirement, do include one or two questions related to the database technology. In many organizations however, the responsibility of implementing (distributed) algorithm on the data infrastructure is taken over by a special backend engineer role, called data engineer.
Others: Visualization, Natural Language Processing, Advanced Maths
These are more specialized fields in data science. I will not cover them extensively here. Research-based data scientist position requires knowledge in advanced mathematics such as Linear Algebra (matrices and vectors) and Multi-variable Calculus (e.g. used in the research of custom machine learning models). If the J.D. contains these as requirements, do add some questions to assess the candidate’s knowledge on these areas.
- Name and describe any data visualizations you know apart from the standard chart types you can find in Excel.
- Describe the various steps in an NLP pipeline.
- DeZyre, 2017 Data Science Interview Questions for Top Tech Companies.
- DeZyre, Top Machine Learning Interview Questions and Answers for 2017.
- MrMimic, Github, Swami Chandrasekaran, Becoming a Data Scientist – Curriculum via Metromap, Pragmatic Perspectives.