EmrAddStepsOperatorTransformer¶
-
class
ditto.transformers.emr.
EmrAddStepsOperatorTransformer
(target_dag, defaults)[source]¶ Bases:
ditto.api.OperatorTransformer
Transforms the operator
EmrAddStepsOperator
- Parameters
target_dag¶ (
DAG
) – the target to which the transformed operators must be addeddefaults¶ (
TransformerDefaults
) – the default configuration for this transformer
-
DEFAULT_PROXY_USER
= 'admin'¶ default livy proxy user
-
DEFAULT_SPARK_CONF
= {'spark.shuffle.compress': 'false'}¶ default spark configuration to use
-
HADOOP_DISTCP_DEFAULT_MAPPERS
= 100¶ number of mappers to use in case its a distcp
-
static
get_target_step_task_id
(add_step_task_id, add_step_step_id)[source]¶ generates a task_id for the transformed output operators for each input EMR step
-
transform
(src_operator, parent_fragment, upstream_fragments)[source]¶ This transformer assumes and relies on the fact that an upstream transformation of a
EmrCreateJobFlowOperator
has already taken place, since it needs to find the output of that transformation to get the cluster_name and azure_conn_id from that operator (which should have been aAzureHDInsightCreateClusterOperator
)It then goes through the EMR steps of this
EmrAddStepsOperator
and creates aLivyBatchOperator
or anAzureHDInsightSshOperator
for each corresponding step, based on grokking the step’s params and figuring out whether its a spark job being run on an arbitrary hadoop command like distcp, hdfs or the like.Note
This transformer creates multiple operators from a single source operator
Note
The spark configuration for the livy spark job are derived from step[‘HadoopJarStep’][‘Properties’] of the EMR step, or could even be specified at the cluster level itself when transforming the job flow
- Return type