EmrAddStepsOperatorTransformer

class ditto.transformers.emr.EmrAddStepsOperatorTransformer(target_dag, defaults)[source]

Bases: ditto.api.OperatorTransformer

Transforms the operator EmrAddStepsOperator

Parameters
  • target_dag (DAG) – the target to which the transformed operators must be added

  • defaults (TransformerDefaults) – the default configuration for this transformer

DEFAULT_PROXY_USER = 'admin'

default livy proxy user

DEFAULT_SPARK_CONF = {'spark.shuffle.compress': 'false'}

default spark configuration to use

HADOOP_DISTCP_DEFAULT_MAPPERS = 100

number of mappers to use in case its a distcp

static get_target_step_task_id(add_step_task_id, add_step_step_id)[source]

generates a task_id for the transformed output operators for each input EMR step

transform(src_operator, parent_fragment, upstream_fragments)[source]

This transformer assumes and relies on the fact that an upstream transformation of a EmrCreateJobFlowOperator has already taken place, since it needs to find the output of that transformation to get the cluster_name and azure_conn_id from that operator (which should have been a AzureHDInsightCreateClusterOperator)

It then goes through the EMR steps of this EmrAddStepsOperator and creates a LivyBatchOperator or an AzureHDInsightSshOperator for each corresponding step, based on grokking the step’s params and figuring out whether its a spark job being run on an arbitrary hadoop command like distcp, hdfs or the like.

Note

This transformer creates multiple operators from a single source operator

Note

The spark configuration for the livy spark job are derived from step[‘HadoopJarStep’][‘Properties’] of the EMR step, or could even be specified at the cluster level itself when transforming the job flow

Return type

DAGFragment