synapse.ml.causal package

Submodules

synapse.ml.causal.DoubleMLEstimator module

class synapse.ml.causal.DoubleMLEstimator.DoubleMLEstimator(java_obj=None, confidenceLevel=0.975, featuresCol=None, maxIter=1, outcomeCol=None, outcomeModel=None, parallelism=10, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, weightCol=None)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaEstimator

Parameters
  • confidenceLevel (float) – confidence level, default value is 0.975

  • featuresCol (str) – The name of the features column

  • maxIter (int) – maximum number of iterations (>= 0)

  • outcomeCol (str) – outcome column

  • outcomeModel (object) – outcome model to run

  • parallelism (int) – the number of threads to use when running parallel algorithms

  • sampleSplitRatio (list) – Sample split ratio for cross-fitting. Default: [0.5, 0.5].

  • treatmentCol (str) – treatment column

  • treatmentModel (object) – treatment model to run

  • weightCol (str) – The name of the weight column

confidenceLevel = Param(parent='undefined', name='confidenceLevel', doc='confidence level, default value is 0.975')
featuresCol = Param(parent='undefined', name='featuresCol', doc='The name of the features column')
getConfidenceLevel()[source]
Returns

confidence level, default value is 0.975

Return type

confidenceLevel

getFeaturesCol()[source]
Returns

The name of the features column

Return type

featuresCol

static getJavaPackage()[source]

Returns package name String.

getMaxIter()[source]
Returns

maximum number of iterations (>= 0)

Return type

maxIter

getOutcomeCol()[source]
Returns

outcome column

Return type

outcomeCol

getOutcomeModel()[source]
Returns

outcome model to run

Return type

outcomeModel

getParallelism()[source]
Returns

the number of threads to use when running parallel algorithms

Return type

parallelism

getSampleSplitRatio()[source]
Returns

Sample split ratio for cross-fitting. Default: [0.5, 0.5].

Return type

sampleSplitRatio

getTreatmentCol()[source]
Returns

treatment column

Return type

treatmentCol

getTreatmentModel()[source]
Returns

treatment model to run

Return type

treatmentModel

getWeightCol()[source]
Returns

The name of the weight column

Return type

weightCol

maxIter = Param(parent='undefined', name='maxIter', doc='maximum number of iterations (>= 0)')
outcomeCol = Param(parent='undefined', name='outcomeCol', doc='outcome column')
outcomeModel = Param(parent='undefined', name='outcomeModel', doc='outcome model to run')
parallelism = Param(parent='undefined', name='parallelism', doc='the number of threads to use when running parallel algorithms')
classmethod read()[source]

Returns an MLReader instance for this class.

sampleSplitRatio = Param(parent='undefined', name='sampleSplitRatio', doc='Sample split ratio for cross-fitting. Default: [0.5, 0.5].')
setConfidenceLevel(value)[source]
Parameters

confidenceLevel – confidence level, default value is 0.975

setFeaturesCol(value)[source]
Parameters

featuresCol – The name of the features column

setMaxIter(value)[source]
Parameters

maxIter – maximum number of iterations (>= 0)

setOutcomeCol(value)[source]
Parameters

outcomeCol – outcome column

setOutcomeModel(value)[source]
Parameters

outcomeModel – outcome model to run

setParallelism(value)[source]
Parameters

parallelism – the number of threads to use when running parallel algorithms

setParams(confidenceLevel=0.975, featuresCol=None, maxIter=1, outcomeCol=None, outcomeModel=None, parallelism=10, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, weightCol=None)[source]

Set the (keyword only) parameters

setSampleSplitRatio(value)[source]
Parameters

sampleSplitRatio – Sample split ratio for cross-fitting. Default: [0.5, 0.5].

setTreatmentCol(value)[source]
Parameters

treatmentCol – treatment column

setTreatmentModel(value)[source]
Parameters

treatmentModel – treatment model to run

setWeightCol(value)[source]
Parameters

weightCol – The name of the weight column

treatmentCol = Param(parent='undefined', name='treatmentCol', doc='treatment column')
treatmentModel = Param(parent='undefined', name='treatmentModel', doc='treatment model to run')
weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')

synapse.ml.causal.DoubleMLModel module

class synapse.ml.causal.DoubleMLModel.DoubleMLModel(java_obj=None, confidenceLevel=0.975, featuresCol=None, maxIter=1, outcomeCol=None, outcomeModel=None, parallelism=10, rawTreatmentEffects=None, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, weightCol=None)[source]

Bases: synapse.ml.causal._DoubleMLModel._DoubleMLModel

getAvgTreatmentEffect()[source]
getConfidenceInterval()[source]

synapse.ml.causal.ResidualTransformer module

class synapse.ml.causal.ResidualTransformer.ResidualTransformer(java_obj=None, classIndex=1, observedCol='label', outputCol='residual', predictedCol='prediction')[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters
  • classIndex (int) – The index of the class to compute residual for classification outputs. Default value is 1.

  • observedCol (str) – observed data (label column)

  • outputCol (str) – The name of the output column

  • predictedCol (str) – predicted data (prediction or probability columns

classIndex = Param(parent='undefined', name='classIndex', doc='The index of the class to compute residual for classification outputs. Default value is 1.')
getClassIndex()[source]
Returns

The index of the class to compute residual for classification outputs. Default value is 1.

Return type

classIndex

static getJavaPackage()[source]

Returns package name String.

getObservedCol()[source]
Returns

observed data (label column)

Return type

observedCol

getOutputCol()[source]
Returns

The name of the output column

Return type

outputCol

getPredictedCol()[source]
Returns

predicted data (prediction or probability columns

Return type

predictedCol

observedCol = Param(parent='undefined', name='observedCol', doc='observed data (label column)')
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
predictedCol = Param(parent='undefined', name='predictedCol', doc='predicted data (prediction or probability columns')
classmethod read()[source]

Returns an MLReader instance for this class.

setClassIndex(value)[source]
Parameters

classIndex – The index of the class to compute residual for classification outputs. Default value is 1.

setObservedCol(value)[source]
Parameters

observedCol – observed data (label column)

setOutputCol(value)[source]
Parameters

outputCol – The name of the output column

setParams(classIndex=1, observedCol='label', outputCol='residual', predictedCol='prediction')[source]

Set the (keyword only) parameters

setPredictedCol(value)[source]
Parameters

predictedCol – predicted data (prediction or probability columns

Module contents

SynapseML is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark in several new directions. SynapseML adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV. These tools enable powerful and highly-scalable predictive and analytical models for a variety of datasources.

SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. In this vein, SynapseML provides easy to use SparkML transformers for a wide variety of Microsoft Cognitive Services. For production grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.

SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+.