Databricks Certified Machine Learning Associate
The Databricks Certified Machine Learning Associate certification exam assesses an individual’s ability to use Databricks to perform basic machine learning tasks. This includes an ability to understand and use Databricks Machine Learning and its capabilities like AutoML, Feature Store, and select capabilities of MLflow. It also assesses the ability to make correct decisions in machine learning workflows and implement those workflows using Spark ML. Finally, an ability to understand advanced characteristics of scaling machine learning models is assessed. Individuals who pass this certification exam can be expected to complete basic machine learning tasks using Databricks and its associated tools.
This exam covers:
Databricks Machine Learning – 29%
ML Workflows – 29%
Spark ML – 33%
Scaling ML Models – 9%
Examkingdom Databricks Certified Machine Learning Associate Exam pdf
Best Databricks Certified Machine Learning Associate Downloads, Databricks Certified Machine Learning Associate Dumps at Certkingdom.com
Assessment Details
Type: Proctored certification
Total number of questions: 45
Time limit: 90 minutes
Registration fee: $
Question types: Multiple choice
Test aides: None allowed
Languages: English
Delivery method: Online proctored
Prerequisites: None, but related training highly recommended
Recommended experience: 6+ months of hands-on experience performing the machine learning tasks outlined in the exam guide
Validity period: 2 years
Recertification:
Recertification is required every two years to maintain your certified status. To recertify, you must take the current version of the exam. Please review the “Getting Ready for the Exam” section below to prepare for your recertification exam.
Unscored content: Exams may include unscored items to gather statistical information for future use. These items are not identified on the form and do not impact your score. Additional time is factored into the exam to account for this content.
Getting Ready for the Exam
Review the Machine Learning Associate Exam Guide to understand what will be on the exam
Take the related training
Register for the exam
Review the technical requirements and run a system check
Review the exam guide again to identify any gaps
Study to fill in the gaps
Take your exam!
All machine learning code within this exam will be in Python. In the case of workflows or code not specific to machine learning tasks, data manipulation code could be provided in SQL.
Exam outline
Section 1: Databricks Machine Learning Databricks ML
* Identify when a standard cluster is preferred over a single-node cluster and vice versa
* Connect a repo from an external Git provider to Databricks repos.
* Commit changes from a Databricks Repo to an external Git provider.
* Create a new branch and commit changes to an external Git provider.
* Pull changes from an external Git provider back to a Databricks workspace.
* Orchestrate multi-task ML workflows using Databricks jobs.
Databricks Runtime for Machine Learning
* Create a cluster with the Databricks Runtime for Machine Learning.
* Install a Python library to be available to all notebooks that run on a cluster. AutoML
* Identify the steps of the machine learning workflow completed by AutoML.
* Identify how to locate the source code for the best model produced by AutoML.
* Identify which evaluation metrics AutoML can use for regression problems.
* Identify the key attributes of the data set using the AutoML data exploration notebook.
Feature Store
* Describe the benefits of using Feature Store to store and access features for machine learning pipelines.
* Create a feature store table.
* Write data to a feature store table.
* Train a model with features from a feature store table.
* Score a model using features from a feature store table.
Managed MLflow
* Identify the best run using the MLflow Client API.
* Manually log metrics, artifacts, and models in an MLflow Run.
* Create a nested Run for deeper Tracking organization.
* Locate the time a run was executed in the MLflow UI.
* Locate the code that was executed with a run in the MLflow UI.
* Register a model using the MLflow Client API.
* Transition a model’s stage using the Model Registry UI page.
* Transition a model’s stage using the MLflow Client API.
* Request to transition a model’s stage using the ML Registry UI page.
Section 2: ML Workflows
Exploratory Data Analysis
* Compute summary statistics on a Spark DataFrame using .summary()
* Compute summary statistics on a Spark DataFrame using dbutils data summaries.
* Remove outliers from a Spark DataFrame that are beyond or less than a designated threshold.
Feature Engineering
* Identify why it is important to add indicator variables for missing values that have been imputed or replaced.
* Describe when replacing missing values with the mode value is an appropriate way to handle missing values.
* Compare and contrast imputing missing values with the mean value or median value.
* Impute missing values with the mean or median value.
* Describe the process of one-hot encoding categorical features.
* Describe why one-hot encoding categorical features can be inefficient for tree-based models.
Training
* Perform random search as a method for tuning hyperparameters.
* Describe the basics of Bayesian methods for tuning hyperparameters.
* Describe why parallelizing sequential/iterative models can be difficult.
* Understand the balance between compute resources and parallelization.
* Parallelize the tuning of hyperparameters using Hyperopt and SparkTrials.
* Identify the usage of SparkTrials as the tool that enables parallelization for tuning single-node models.
Evaluation and Selection
* Describe cross-validation and the benefits of downsides of using cross-validation over a train-validation split.
* Perform cross-validation as a part of model fitting.
* Identify the number of models being trained in conjunction with a grid-search and cross-validation process.
* Describe Recall and F1 as evaluation metrics.
* Identify the need to exponentiate the RMSE when the log of the label variable is used.
* Identify that the RMSE has not been exponentiated when the log of the label variable is used.
Section 3: Spark ML
Distributed ML Concepts
* Describe some of the difficulties associated with distributing machine learning models.
* Identify Spark ML as a key library for distributing traditional machine learning work.
* Identify scikit-learn as a single-node solution relative to Spark ML.
Spark ML Modeling APIs
* Split data using Spark ML.
* Identify key gotchas when splitting distributed data using Spark ML.
* Train / evaluate a machine learning model using Spark ML.
* Describe Spark ML estimator and Spark ML transformer.
* Develop a Pipeline using Spark ML.
* Identify key gotchas when developing a Spark ML Pipeline.
Hyperopt
* Identify Hyperopt as a solution for parallelizing the tuning of single-node models.
* Identify Hyperopt as a solution for Bayesian hyperparameter inference for distributed models.
* Parallelize the tuning of hyperparameters for Spark ML models using Hyperopt and Trials.
* Identify the relationship between the number of trials and model accuracy.
Pandas API on Spark
* Describe key differences between Spark DataFrames and Pandas on Spark DataFrames.
* Identify the usage of an InternalFrame making Pandas API on Spark not quite as fast as native Spark.
* Identify Pandas API on Spark as a solution for scaling data pipelines without much refactoring.
* Convert data between a PySpark DataFrame and a Pandas on Spark DataFrame.
* Identify how to import and use the Pandas on Spark APIs.
Pandas UDFs/Function APIs
* Identify Apache Arrow as the key to Pandas <-> Spark conversions.
* Describe why iterator UDFs are preferred for large data.
* Apply a model in parallel using a Pandas UDF.
* Identify that pandas code can be used inside of a UDF function.
* Train / apply group-specific models using the Pandas Function API.
Section 4: Scaling ML Models
Model Distribution
* Describe how Spark scales linear regression.
* Describe how Spark scales decision trees.
Ensembling Distribution
* Describe the basic concepts of ensemble learning.
* Compare and contrast bagging, boosting, and stacking.
Sample Question and Answers
QUESTION 1
A machine learning engineer has created a Feature Table new_table using Feature Store Client fs.
When creating the table, they specified a metadata description with key information about the
Feature Table. They now want to retrieve that metadata programmatically.
Which of the following lines of code will return the metadata description?
A. There is no way to return the metadata description programmatically.
B. fs.create_training_set(“new_table”)
C. fs.get_table(“new_table”).description
D. fs.get_table(“new_table”).load_df()
E. fs.get_table(“new_table”)
Answer: C
Explanation:
To retrieve the metadata description of a feature table created using the Feature Store Client
(referred here as fs), the correct method involves calling get_table on the fs client with the table
name as an argument, followed by accessing the description attribute of the returned object. The
code snippet fs.get_table(“new_table”).description correctly achieves this by fetching the table
object for “new_table” and then accessing its description attribute, where the metadata is stored.
The other options do not correctly focus on retrieving the metadata description.
Reference:
Databricks Feature Store documentation (Accessing Feature Table Metadata).
QUESTION 2
A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that
contains only the rows from spark_df where the value in column price is greater than 0.
Which of the following code blocks will accomplish this task?
A. spark_df[spark_df[“price”] > 0]
B. spark_df.filter(col(“price”) > 0)
C. SELECT * FROM spark_df WHERE price > 0
D. spark_df.loc[spark_df[“price”] > 0,:]
E. spark_df.loc[:,spark_df[“price”] > 0]
Answer: B
Explanation:
To filter rows in a Spark DataFrame based on a condition, you use the filter method along with a
column condition. The correct syntax in PySpark to accomplish this task is spark_df.filter(col(“price”)
> 0), which filters the DataFrame to include only those rows where the value in the “price” column is
greater than 0. The col function is used to specify column-based operations. The other options
provided either do not use correct Spark DataFrame syntax or are intended for different types of data
manipulation frameworks like pandas.
Reference:
PySpark DataFrame API documentation (Filtering DataFrames).
QUESTION 3
A health organization is developing a classification model to determine whether or not a patient
currently has a specific type of infection. The organization’s leaders want to maximize the number of
positive cases identified by the model.
Which of the following classification metrics should be used to evaluate the model?
A. RMSE
B. Precision
C. Area under the residual operating curve
D. Accuracy
E. Recall
Answer: E
Explanation:
When the goal is to maximize the identification of positive cases in a classification task, the metric of
interest is Recall. Recall, also known as sensitivity, measures the proportion of actual positives that
are correctly identified by the model (i.e., the true positive rate). It is crucial for scenarios where
missing a positive case (false negative) has serious implications, such as in medical diagnostics.
The other metrics like Precision, RMSE, and Accuracy serve different aspects of performance
measurement and are not specifically focused on maximizing the detection of positive cases alone.
Reference:
Classification Metrics in Machine Learning (Understanding Recall).
QUESTION 4
In which of the following situations is it preferable to impute missing feature values with their median value over the mean value?
A. When the features are of the categorical type
B. When the features are of the boolean type
C. When the features contain a lot of extreme outliers
D. When the features contain no outliers
E. When the features contain no missing no values
Answer: C
Explanation:
Imputing missing values with the median is often preferred over the mean in scenarios where the
data contains a lot of extreme outliers. The median is a more robust measure of central tendency in
such cases, as it is not as heavily influenced by outliers as the mean. Using the median ensures that
the imputed values are more representative of the typical data point, thus preserving the integrity of
the dataset’s distribution. The other options are not specifically relevant to the question of handling
outliers in numerical data.
Reference:
Data Imputation Techniques (Dealing with Outliers).
QUESTION 5
A data scientist has replaced missing values in their feature set with each respective feature
variables median value. A colleague suggests that the data scientist is throwing away valuable
information by doing this.
Which of the following approaches can they take to include as much information as possible in the feature set?
A. Impute the missing values using each respective feature variable’s mean value instead of the median value
B. Refrain from imputing the missing values in favor of letting the machine learning algorithm determine how to handle them
C. Remove all feature variables that originally contained missing values from the feature set
D. Create a binary feature variable for each feature that contained missing values indicating whether each row’s value has been imputed
E. Create a constant feature variable for each feature that contained missing values indicating the percentage of rows from the feature that was originally missing
Answer: D
Explanation:
By creating a binary feature variable for each feature with missing values to indicate whether a value
has been imputed, the data scientist can preserve information about the original state of the data.
This approach maintains the integrity of the dataset by marking which values are original and which
are synthetic (imputed). Here are the steps to implement this approach:
Identify Missing Values: Determine which features contain missing values.
Impute Missing Values: Continue with median imputation or choose another method (mean, mode,
regression, etc.) to fill missing values.
Create Indicator Variables: For each feature that had missing values, add a new binary feature. This
feature should be ‘1’ if the original value was missing and imputed, and ‘0’ otherwise.
Data Integration: Integrate these new binary features into the existing dataset. This maintains a
record of where data imputation occurred, allowing models to potentially weight these observations differently.
Leave a Reply