synapse.ml.causal package

Submodules

synapse.ml.causal.DiffInDiffEstimator module

class synapse.ml.causal.DiffInDiffEstimator.DiffInDiffEstimator(java_obj=None, outcomeCol=None, postTreatmentCol=None, treatmentCol=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • outcomeCol (str) – outcome column

  • postTreatmentCol (str) – post treatment indicator column

  • treatmentCol (str) – treatment column

static getJavaPackage()[source]

Returns package name String.

getOutcomeCol()[source]
Returns

outcome column

Return type

outcomeCol

getPostTreatmentCol()[source]
Returns

post treatment indicator column

Return type

postTreatmentCol

getTreatmentCol()[source]
Returns

treatment column

Return type

treatmentCol

outcomeCol = Param(parent='undefined', name='outcomeCol', doc='outcome column')
postTreatmentCol = Param(parent='undefined', name='postTreatmentCol', doc='post treatment indicator column')
classmethod read()[source]

Returns an MLReader instance for this class.

setOutcomeCol(value)[source]
Parameters

outcomeCol – outcome column

setParams(outcomeCol=None, postTreatmentCol=None, treatmentCol=None)[source]

Set the (keyword only) parameters

setPostTreatmentCol(value)[source]
Parameters

postTreatmentCol – post treatment indicator column

setTreatmentCol(value)[source]
Parameters

treatmentCol – treatment column

treatmentCol = Param(parent='undefined', name='treatmentCol', doc='treatment column')

synapse.ml.causal.DiffInDiffModel module

class synapse.ml.causal.DiffInDiffModel.DiffInDiffModel(java_obj=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

synapse.ml.causal.DoubleMLEstimator module

class synapse.ml.causal.DoubleMLEstimator.DoubleMLEstimator(java_obj=None, confidenceLevel=0.975, featuresCol=None, maxIter=1, outcomeCol=None, outcomeModel=None, parallelism=10, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, weightCol=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • confidenceLevel (float) – confidence level, default value is 0.975

  • featuresCol (str) – The name of the features column

  • maxIter (int) – maximum number of iterations (>= 0)

  • outcomeCol (str) – outcome column

  • outcomeModel (object) – outcome model to run

  • parallelism (int) – the number of threads to use when running parallel algorithms

  • sampleSplitRatio (list) – Sample split ratio for cross-fitting. Default: [0.5, 0.5].

  • treatmentCol (str) – treatment column

  • treatmentModel (object) – treatment model to run

  • weightCol (str) – The name of the weight column

confidenceLevel = Param(parent='undefined', name='confidenceLevel', doc='confidence level, default value is 0.975')
featuresCol = Param(parent='undefined', name='featuresCol', doc='The name of the features column')
getConfidenceLevel()[source]
Returns

confidence level, default value is 0.975

Return type

confidenceLevel

getFeaturesCol()[source]
Returns

The name of the features column

Return type

featuresCol

static getJavaPackage()[source]

Returns package name String.

getMaxIter()[source]
Returns

maximum number of iterations (>= 0)

Return type

maxIter

getOutcomeCol()[source]
Returns

outcome column

Return type

outcomeCol

getOutcomeModel()[source]
Returns

outcome model to run

Return type

outcomeModel

getParallelism()[source]
Returns

the number of threads to use when running parallel algorithms

Return type

parallelism

getSampleSplitRatio()[source]
Returns

Sample split ratio for cross-fitting. Default: [0.5, 0.5].

Return type

sampleSplitRatio

getTreatmentCol()[source]
Returns

treatment column

Return type

treatmentCol

getTreatmentModel()[source]
Returns

treatment model to run

Return type

treatmentModel

getWeightCol()[source]
Returns

The name of the weight column

Return type

weightCol

maxIter = Param(parent='undefined', name='maxIter', doc='maximum number of iterations (>= 0)')
outcomeCol = Param(parent='undefined', name='outcomeCol', doc='outcome column')
outcomeModel = Param(parent='undefined', name='outcomeModel', doc='outcome model to run')
parallelism = Param(parent='undefined', name='parallelism', doc='the number of threads to use when running parallel algorithms')
classmethod read()[source]

Returns an MLReader instance for this class.

sampleSplitRatio = Param(parent='undefined', name='sampleSplitRatio', doc='Sample split ratio for cross-fitting. Default: [0.5, 0.5].')
setConfidenceLevel(value)[source]
Parameters

confidenceLevel – confidence level, default value is 0.975

setFeaturesCol(value)[source]
Parameters

featuresCol – The name of the features column

setMaxIter(value)[source]
Parameters

maxIter – maximum number of iterations (>= 0)

setOutcomeCol(value)[source]
Parameters

outcomeCol – outcome column

setOutcomeModel(value)[source]
Parameters

outcomeModel – outcome model to run

setParallelism(value)[source]
Parameters

parallelism – the number of threads to use when running parallel algorithms

setParams(confidenceLevel=0.975, featuresCol=None, maxIter=1, outcomeCol=None, outcomeModel=None, parallelism=10, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, weightCol=None)[source]

Set the (keyword only) parameters

setSampleSplitRatio(value)[source]
Parameters

sampleSplitRatio – Sample split ratio for cross-fitting. Default: [0.5, 0.5].

setTreatmentCol(value)[source]
Parameters

treatmentCol – treatment column

setTreatmentModel(value)[source]
Parameters

treatmentModel – treatment model to run

setWeightCol(value)[source]
Parameters

weightCol – The name of the weight column

treatmentCol = Param(parent='undefined', name='treatmentCol', doc='treatment column')
treatmentModel = Param(parent='undefined', name='treatmentModel', doc='treatment model to run')
weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')

synapse.ml.causal.DoubleMLModel module

class synapse.ml.causal.DoubleMLModel.DoubleMLModel(java_obj=None, confidenceLevel=0.975, featuresCol=None, maxIter=1, outcomeCol=None, outcomeModel=None, parallelism=10, rawTreatmentEffects=None, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, weightCol=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

getAvgTreatmentEffect()[source]
getConfidenceInterval()[source]
getPValue()[source]

synapse.ml.causal.OrthoForestDMLEstimator module

class synapse.ml.causal.OrthoForestDMLEstimator.OrthoForestDMLEstimator(java_obj=None, confidenceLevel=0.975, confounderVecCol='XW', featuresCol=None, heterogeneityVecCol='X', maxDepth=5, maxIter=1, minSamplesLeaf=10, numTrees=20, outcomeCol=None, outcomeModel=None, outcomeResidualCol='OutcomeResidual', outputCol='EffectAverage', outputHighCol='EffectUpperBound', outputLowCol='EffectLowerBound', parallelism=10, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, treatmentResidualCol='TreatmentResidual', weightCol=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • confidenceLevel (float) – confidence level, default value is 0.975

  • confounderVecCol (str) – Confounders to control for

  • featuresCol (str) – The name of the features column

  • heterogeneityVecCol (str) – Vector to divide the treatment by

  • maxDepth (int) – Max Depth of Tree

  • maxIter (int) – maximum number of iterations (>= 0)

  • minSamplesLeaf (int) – Max Depth of Tree

  • numTrees (int) – Number of trees

  • outcomeCol (str) – outcome column

  • outcomeModel (object) – outcome model to run

  • outcomeResidualCol (str) – Outcome Residual Column

  • outputCol (str) – The name of the output column

  • outputHighCol (str) – Output Confidence Interval Low

  • outputLowCol (str) – Output Confidence Interval Low

  • parallelism (int) – the number of threads to use when running parallel algorithms

  • sampleSplitRatio (list) – Sample split ratio for cross-fitting. Default: [0.5, 0.5].

  • treatmentCol (str) – treatment column

  • treatmentModel (object) – treatment model to run

  • treatmentResidualCol (str) – Treatment Residual Column

  • weightCol (str) – The name of the weight column

confidenceLevel = Param(parent='undefined', name='confidenceLevel', doc='confidence level, default value is 0.975')
confounderVecCol = Param(parent='undefined', name='confounderVecCol', doc='Confounders to control for')
featuresCol = Param(parent='undefined', name='featuresCol', doc='The name of the features column')
getConfidenceLevel()[source]
Returns

confidence level, default value is 0.975

Return type

confidenceLevel

getConfounderVecCol()[source]
Returns

Confounders to control for

Return type

confounderVecCol

getFeaturesCol()[source]
Returns

The name of the features column

Return type

featuresCol

getHeterogeneityVecCol()[source]
Returns

Vector to divide the treatment by

Return type

heterogeneityVecCol

static getJavaPackage()[source]

Returns package name String.

getMaxDepth()[source]
Returns

Max Depth of Tree

Return type

maxDepth

getMaxIter()[source]
Returns

maximum number of iterations (>= 0)

Return type

maxIter

getMinSamplesLeaf()[source]
Returns

Max Depth of Tree

Return type

minSamplesLeaf

getNumTrees()[source]
Returns

Number of trees

Return type

numTrees

getOutcomeCol()[source]
Returns

outcome column

Return type

outcomeCol

getOutcomeModel()[source]
Returns

outcome model to run

Return type

outcomeModel

getOutcomeResidualCol()[source]
Returns

Outcome Residual Column

Return type

outcomeResidualCol

getOutputCol()[source]
Returns

The name of the output column

Return type

outputCol

getOutputHighCol()[source]
Returns

Output Confidence Interval Low

Return type

outputHighCol

getOutputLowCol()[source]
Returns

Output Confidence Interval Low

Return type

outputLowCol

getParallelism()[source]
Returns

the number of threads to use when running parallel algorithms

Return type

parallelism

getSampleSplitRatio()[source]
Returns

Sample split ratio for cross-fitting. Default: [0.5, 0.5].

Return type

sampleSplitRatio

getTreatmentCol()[source]
Returns

treatment column

Return type

treatmentCol

getTreatmentModel()[source]
Returns

treatment model to run

Return type

treatmentModel

getTreatmentResidualCol()[source]
Returns

Treatment Residual Column

Return type

treatmentResidualCol

getWeightCol()[source]
Returns

The name of the weight column

Return type

weightCol

heterogeneityVecCol = Param(parent='undefined', name='heterogeneityVecCol', doc='Vector to divide the treatment by')
maxDepth = Param(parent='undefined', name='maxDepth', doc='Max Depth of Tree')
maxIter = Param(parent='undefined', name='maxIter', doc='maximum number of iterations (>= 0)')
minSamplesLeaf = Param(parent='undefined', name='minSamplesLeaf', doc='Max Depth of Tree')
numTrees = Param(parent='undefined', name='numTrees', doc='Number of trees')
outcomeCol = Param(parent='undefined', name='outcomeCol', doc='outcome column')
outcomeModel = Param(parent='undefined', name='outcomeModel', doc='outcome model to run')
outcomeResidualCol = Param(parent='undefined', name='outcomeResidualCol', doc='Outcome Residual Column')
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
outputHighCol = Param(parent='undefined', name='outputHighCol', doc='Output Confidence Interval Low')
outputLowCol = Param(parent='undefined', name='outputLowCol', doc='Output Confidence Interval Low')
parallelism = Param(parent='undefined', name='parallelism', doc='the number of threads to use when running parallel algorithms')
classmethod read()[source]

Returns an MLReader instance for this class.

sampleSplitRatio = Param(parent='undefined', name='sampleSplitRatio', doc='Sample split ratio for cross-fitting. Default: [0.5, 0.5].')
setConfidenceLevel(value)[source]
Parameters

confidenceLevel – confidence level, default value is 0.975

setConfounderVecCol(value)[source]
Parameters

confounderVecCol – Confounders to control for

setFeaturesCol(value)[source]
Parameters

featuresCol – The name of the features column

setHeterogeneityVecCol(value)[source]
Parameters

heterogeneityVecCol – Vector to divide the treatment by

setMaxDepth(value)[source]
Parameters

maxDepth – Max Depth of Tree

setMaxIter(value)[source]
Parameters

maxIter – maximum number of iterations (>= 0)

setMinSamplesLeaf(value)[source]
Parameters

minSamplesLeaf – Max Depth of Tree

setNumTrees(value)[source]
Parameters

numTrees – Number of trees

setOutcomeCol(value)[source]
Parameters

outcomeCol – outcome column

setOutcomeModel(value)[source]
Parameters

outcomeModel – outcome model to run

setOutcomeResidualCol(value)[source]
Parameters

outcomeResidualCol – Outcome Residual Column

setOutputCol(value)[source]
Parameters

outputCol – The name of the output column

setOutputHighCol(value)[source]
Parameters

outputHighCol – Output Confidence Interval Low

setOutputLowCol(value)[source]
Parameters

outputLowCol – Output Confidence Interval Low

setParallelism(value)[source]
Parameters

parallelism – the number of threads to use when running parallel algorithms

setParams(confidenceLevel=0.975, confounderVecCol='XW', featuresCol=None, heterogeneityVecCol='X', maxDepth=5, maxIter=1, minSamplesLeaf=10, numTrees=20, outcomeCol=None, outcomeModel=None, outcomeResidualCol='OutcomeResidual', outputCol='EffectAverage', outputHighCol='EffectUpperBound', outputLowCol='EffectLowerBound', parallelism=10, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, treatmentResidualCol='TreatmentResidual', weightCol=None)[source]

Set the (keyword only) parameters

setSampleSplitRatio(value)[source]
Parameters

sampleSplitRatio – Sample split ratio for cross-fitting. Default: [0.5, 0.5].

setTreatmentCol(value)[source]
Parameters

treatmentCol – treatment column

setTreatmentModel(value)[source]
Parameters

treatmentModel – treatment model to run

setTreatmentResidualCol(value)[source]
Parameters

treatmentResidualCol – Treatment Residual Column

setWeightCol(value)[source]
Parameters

weightCol – The name of the weight column

treatmentCol = Param(parent='undefined', name='treatmentCol', doc='treatment column')
treatmentModel = Param(parent='undefined', name='treatmentModel', doc='treatment model to run')
treatmentResidualCol = Param(parent='undefined', name='treatmentResidualCol', doc='Treatment Residual Column')
weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')

synapse.ml.causal.OrthoForestDMLModel module

class synapse.ml.causal.OrthoForestDMLModel.OrthoForestDMLModel(java_obj=None, confidenceLevel=0.975, confounderVecCol='XW', featuresCol=None, forest=None, heterogeneityVecCol='X', maxDepth=5, maxIter=1, minSamplesLeaf=10, numTrees=20, outcomeCol=None, outcomeModel=None, outcomeResidualCol='OutcomeResidual', outputCol='EffectAverage', outputHighCol='EffectUpperBound', outputLowCol='EffectLowerBound', parallelism=10, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, treatmentResidualCol='TreatmentResidual', weightCol=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • confidenceLevel (float) – confidence level, default value is 0.975

  • confounderVecCol (str) – Confounders to control for

  • featuresCol (str) – The name of the features column

  • forest (object) – Forest Trees produced in Ortho Forest DML Estimator

  • heterogeneityVecCol (str) – Vector to divide the treatment by

  • maxDepth (int) – Max Depth of Tree

  • maxIter (int) – maximum number of iterations (>= 0)

  • minSamplesLeaf (int) – Max Depth of Tree

  • numTrees (int) – Number of trees

  • outcomeCol (str) – outcome column

  • outcomeModel (object) – outcome model to run

  • outcomeResidualCol (str) – Outcome Residual Column

  • outputCol (str) – The name of the output column

  • outputHighCol (str) – Output Confidence Interval Low

  • outputLowCol (str) – Output Confidence Interval Low

  • parallelism (int) – the number of threads to use when running parallel algorithms

  • sampleSplitRatio (list) – Sample split ratio for cross-fitting. Default: [0.5, 0.5].

  • treatmentCol (str) – treatment column

  • treatmentModel (object) – treatment model to run

  • treatmentResidualCol (str) – Treatment Residual Column

  • weightCol (str) – The name of the weight column

confidenceLevel = Param(parent='undefined', name='confidenceLevel', doc='confidence level, default value is 0.975')
confounderVecCol = Param(parent='undefined', name='confounderVecCol', doc='Confounders to control for')
featuresCol = Param(parent='undefined', name='featuresCol', doc='The name of the features column')
forest = Param(parent='undefined', name='forest', doc='Forest Trees produced in Ortho Forest DML Estimator')
getConfidenceLevel()[source]
Returns

confidence level, default value is 0.975

Return type

confidenceLevel

getConfounderVecCol()[source]
Returns

Confounders to control for

Return type

confounderVecCol

getFeaturesCol()[source]
Returns

The name of the features column

Return type

featuresCol

getForest()[source]
Returns

Forest Trees produced in Ortho Forest DML Estimator

Return type

forest

getHeterogeneityVecCol()[source]
Returns

Vector to divide the treatment by

Return type

heterogeneityVecCol

static getJavaPackage()[source]

Returns package name String.

getMaxDepth()[source]
Returns

Max Depth of Tree

Return type

maxDepth

getMaxIter()[source]
Returns

maximum number of iterations (>= 0)

Return type

maxIter

getMinSamplesLeaf()[source]
Returns

Max Depth of Tree

Return type

minSamplesLeaf

getNumTrees()[source]
Returns

Number of trees

Return type

numTrees

getOutcomeCol()[source]
Returns

outcome column

Return type

outcomeCol

getOutcomeModel()[source]
Returns

outcome model to run

Return type

outcomeModel

getOutcomeResidualCol()[source]
Returns

Outcome Residual Column

Return type

outcomeResidualCol

getOutputCol()[source]
Returns

The name of the output column

Return type

outputCol

getOutputHighCol()[source]
Returns

Output Confidence Interval Low

Return type

outputHighCol

getOutputLowCol()[source]
Returns

Output Confidence Interval Low

Return type

outputLowCol

getParallelism()[source]
Returns

the number of threads to use when running parallel algorithms

Return type

parallelism

getSampleSplitRatio()[source]
Returns

Sample split ratio for cross-fitting. Default: [0.5, 0.5].

Return type

sampleSplitRatio

getTreatmentCol()[source]
Returns

treatment column

Return type

treatmentCol

getTreatmentModel()[source]
Returns

treatment model to run

Return type

treatmentModel

getTreatmentResidualCol()[source]
Returns

Treatment Residual Column

Return type

treatmentResidualCol

getWeightCol()[source]
Returns

The name of the weight column

Return type

weightCol

heterogeneityVecCol = Param(parent='undefined', name='heterogeneityVecCol', doc='Vector to divide the treatment by')
maxDepth = Param(parent='undefined', name='maxDepth', doc='Max Depth of Tree')
maxIter = Param(parent='undefined', name='maxIter', doc='maximum number of iterations (>= 0)')
minSamplesLeaf = Param(parent='undefined', name='minSamplesLeaf', doc='Max Depth of Tree')
numTrees = Param(parent='undefined', name='numTrees', doc='Number of trees')
outcomeCol = Param(parent='undefined', name='outcomeCol', doc='outcome column')
outcomeModel = Param(parent='undefined', name='outcomeModel', doc='outcome model to run')
outcomeResidualCol = Param(parent='undefined', name='outcomeResidualCol', doc='Outcome Residual Column')
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
outputHighCol = Param(parent='undefined', name='outputHighCol', doc='Output Confidence Interval Low')
outputLowCol = Param(parent='undefined', name='outputLowCol', doc='Output Confidence Interval Low')
parallelism = Param(parent='undefined', name='parallelism', doc='the number of threads to use when running parallel algorithms')
classmethod read()[source]

Returns an MLReader instance for this class.

sampleSplitRatio = Param(parent='undefined', name='sampleSplitRatio', doc='Sample split ratio for cross-fitting. Default: [0.5, 0.5].')
setConfidenceLevel(value)[source]
Parameters

confidenceLevel – confidence level, default value is 0.975

setConfounderVecCol(value)[source]
Parameters

confounderVecCol – Confounders to control for

setFeaturesCol(value)[source]
Parameters

featuresCol – The name of the features column

setForest(value)[source]
Parameters

forest – Forest Trees produced in Ortho Forest DML Estimator

setHeterogeneityVecCol(value)[source]
Parameters

heterogeneityVecCol – Vector to divide the treatment by

setMaxDepth(value)[source]
Parameters

maxDepth – Max Depth of Tree

setMaxIter(value)[source]
Parameters

maxIter – maximum number of iterations (>= 0)

setMinSamplesLeaf(value)[source]
Parameters

minSamplesLeaf – Max Depth of Tree

setNumTrees(value)[source]
Parameters

numTrees – Number of trees

setOutcomeCol(value)[source]
Parameters

outcomeCol – outcome column

setOutcomeModel(value)[source]
Parameters

outcomeModel – outcome model to run

setOutcomeResidualCol(value)[source]
Parameters

outcomeResidualCol – Outcome Residual Column

setOutputCol(value)[source]
Parameters

outputCol – The name of the output column

setOutputHighCol(value)[source]
Parameters

outputHighCol – Output Confidence Interval Low

setOutputLowCol(value)[source]
Parameters

outputLowCol – Output Confidence Interval Low

setParallelism(value)[source]
Parameters

parallelism – the number of threads to use when running parallel algorithms

setParams(confidenceLevel=0.975, confounderVecCol='XW', featuresCol=None, forest=None, heterogeneityVecCol='X', maxDepth=5, maxIter=1, minSamplesLeaf=10, numTrees=20, outcomeCol=None, outcomeModel=None, outcomeResidualCol='OutcomeResidual', outputCol='EffectAverage', outputHighCol='EffectUpperBound', outputLowCol='EffectLowerBound', parallelism=10, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, treatmentResidualCol='TreatmentResidual', weightCol=None)[source]

Set the (keyword only) parameters

setSampleSplitRatio(value)[source]
Parameters

sampleSplitRatio – Sample split ratio for cross-fitting. Default: [0.5, 0.5].

setTreatmentCol(value)[source]
Parameters

treatmentCol – treatment column

setTreatmentModel(value)[source]
Parameters

treatmentModel – treatment model to run

setTreatmentResidualCol(value)[source]
Parameters

treatmentResidualCol – Treatment Residual Column

setWeightCol(value)[source]
Parameters

weightCol – The name of the weight column

treatmentCol = Param(parent='undefined', name='treatmentCol', doc='treatment column')
treatmentModel = Param(parent='undefined', name='treatmentModel', doc='treatment model to run')
treatmentResidualCol = Param(parent='undefined', name='treatmentResidualCol', doc='Treatment Residual Column')
weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')

synapse.ml.causal.OrthoForestVariableTransformer module

class synapse.ml.causal.OrthoForestVariableTransformer.OrthoForestVariableTransformer(java_obj=None, outcomeResidualCol='OResid', outputCol='_tmp_tsOutcome', treatmentResidualCol='TResid', weightsCol='_tmp_twOutcome')[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • outcomeResidualCol (str) – Outcome Residual Col

  • outputCol (str) – The name of the output column

  • treatmentResidualCol (str) – Treatment Residual Col

  • weightsCol (str) – Weights Col

static getJavaPackage()[source]

Returns package name String.

getOutcomeResidualCol()[source]
Returns

Outcome Residual Col

Return type

outcomeResidualCol

getOutputCol()[source]
Returns

The name of the output column

Return type

outputCol

getTreatmentResidualCol()[source]
Returns

Treatment Residual Col

Return type

treatmentResidualCol

getWeightsCol()[source]
Returns

Weights Col

Return type

weightsCol

outcomeResidualCol = Param(parent='undefined', name='outcomeResidualCol', doc='Outcome Residual Col')
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
classmethod read()[source]

Returns an MLReader instance for this class.

setOutcomeResidualCol(value)[source]
Parameters

outcomeResidualCol – Outcome Residual Col

setOutputCol(value)[source]
Parameters

outputCol – The name of the output column

setParams(outcomeResidualCol='OResid', outputCol='_tmp_tsOutcome', treatmentResidualCol='TResid', weightsCol='_tmp_twOutcome')[source]

Set the (keyword only) parameters

setTreatmentResidualCol(value)[source]
Parameters

treatmentResidualCol – Treatment Residual Col

setWeightsCol(value)[source]
Parameters

weightsCol – Weights Col

treatmentResidualCol = Param(parent='undefined', name='treatmentResidualCol', doc='Treatment Residual Col')
weightsCol = Param(parent='undefined', name='weightsCol', doc='Weights Col')

synapse.ml.causal.ResidualTransformer module

class synapse.ml.causal.ResidualTransformer.ResidualTransformer(java_obj=None, classIndex=1, observedCol='label', outputCol='residual', predictedCol='prediction')[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • classIndex (int) – The index of the class to compute residual for classification outputs. Default value is 1.

  • observedCol (str) – observed data (label column)

  • outputCol (str) – The name of the output column

  • predictedCol (str) – predicted data (prediction or probability columns

classIndex = Param(parent='undefined', name='classIndex', doc='The index of the class to compute residual for classification outputs. Default value is 1.')
getClassIndex()[source]
Returns

The index of the class to compute residual for classification outputs. Default value is 1.

Return type

classIndex

static getJavaPackage()[source]

Returns package name String.

getObservedCol()[source]
Returns

observed data (label column)

Return type

observedCol

getOutputCol()[source]
Returns

The name of the output column

Return type

outputCol

getPredictedCol()[source]
Returns

predicted data (prediction or probability columns

Return type

predictedCol

observedCol = Param(parent='undefined', name='observedCol', doc='observed data (label column)')
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
predictedCol = Param(parent='undefined', name='predictedCol', doc='predicted data (prediction or probability columns')
classmethod read()[source]

Returns an MLReader instance for this class.

setClassIndex(value)[source]
Parameters

classIndex – The index of the class to compute residual for classification outputs. Default value is 1.

setObservedCol(value)[source]
Parameters

observedCol – observed data (label column)

setOutputCol(value)[source]
Parameters

outputCol – The name of the output column

setParams(classIndex=1, observedCol='label', outputCol='residual', predictedCol='prediction')[source]

Set the (keyword only) parameters

setPredictedCol(value)[source]
Parameters

predictedCol – predicted data (prediction or probability columns

synapse.ml.causal.SyntheticControlEstimator module

class synapse.ml.causal.SyntheticControlEstimator.SyntheticControlEstimator(java_obj=None, epsilon=1e-10, handleMissingOutcome='zero', localSolverThreshold=1000000, maxIter=100, numIterNoChange=None, outcomeCol=None, postTreatmentCol=None, stepSize=1.0, timeCol=None, tol=0.001, treatmentCol=None, unitCol=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • epsilon (float) – This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.

  • handleMissingOutcome (str) – How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)

  • localSolverThreshold (long) – threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.

  • maxIter (int) – maximum number of iterations (>= 0)

  • numIterNoChange (int) – Early termination when number of iterations without change reached.

  • outcomeCol (str) – outcome column

  • postTreatmentCol (str) – post treatment indicator column

  • stepSize (float) – Step size to be used for each iteration of optimization (> 0)

  • timeCol (str) – Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.

  • tol (float) – the convergence tolerance for iterative algorithms (>= 0)

  • treatmentCol (str) – treatment column

  • unitCol (str) – Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.

epsilon = Param(parent='undefined', name='epsilon', doc='This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.')
getEpsilon()[source]
Returns

This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.

Return type

epsilon

getHandleMissingOutcome()[source]
Returns

How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)

Return type

handleMissingOutcome

static getJavaPackage()[source]

Returns package name String.

getLocalSolverThreshold()[source]
Returns

threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.

Return type

localSolverThreshold

getMaxIter()[source]
Returns

maximum number of iterations (>= 0)

Return type

maxIter

getNumIterNoChange()[source]
Returns

Early termination when number of iterations without change reached.

Return type

numIterNoChange

getOutcomeCol()[source]
Returns

outcome column

Return type

outcomeCol

getPostTreatmentCol()[source]
Returns

post treatment indicator column

Return type

postTreatmentCol

getStepSize()[source]
Returns

Step size to be used for each iteration of optimization (> 0)

Return type

stepSize

getTimeCol()[source]
Returns

Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.

Return type

timeCol

getTol()[source]
Returns

the convergence tolerance for iterative algorithms (>= 0)

Return type

tol

getTreatmentCol()[source]
Returns

treatment column

Return type

treatmentCol

getUnitCol()[source]
Returns

Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.

Return type

unitCol

handleMissingOutcome = Param(parent='undefined', name='handleMissingOutcome', doc='How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)')
localSolverThreshold = Param(parent='undefined', name='localSolverThreshold', doc='threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.')
maxIter = Param(parent='undefined', name='maxIter', doc='maximum number of iterations (>= 0)')
numIterNoChange = Param(parent='undefined', name='numIterNoChange', doc='Early termination when number of iterations without change reached.')
outcomeCol = Param(parent='undefined', name='outcomeCol', doc='outcome column')
postTreatmentCol = Param(parent='undefined', name='postTreatmentCol', doc='post treatment indicator column')
classmethod read()[source]

Returns an MLReader instance for this class.

setEpsilon(value)[source]
Parameters

epsilon – This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.

setHandleMissingOutcome(value)[source]
Parameters

handleMissingOutcome – How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)

setLocalSolverThreshold(value)[source]
Parameters

localSolverThreshold – threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.

setMaxIter(value)[source]
Parameters

maxIter – maximum number of iterations (>= 0)

setNumIterNoChange(value)[source]
Parameters

numIterNoChange – Early termination when number of iterations without change reached.

setOutcomeCol(value)[source]
Parameters

outcomeCol – outcome column

setParams(epsilon=1e-10, handleMissingOutcome='zero', localSolverThreshold=1000000, maxIter=100, numIterNoChange=None, outcomeCol=None, postTreatmentCol=None, stepSize=1.0, timeCol=None, tol=0.001, treatmentCol=None, unitCol=None)[source]

Set the (keyword only) parameters

setPostTreatmentCol(value)[source]
Parameters

postTreatmentCol – post treatment indicator column

setStepSize(value)[source]
Parameters

stepSize – Step size to be used for each iteration of optimization (> 0)

setTimeCol(value)[source]
Parameters

timeCol – Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.

setTol(value)[source]
Parameters

tol – the convergence tolerance for iterative algorithms (>= 0)

setTreatmentCol(value)[source]
Parameters

treatmentCol – treatment column

setUnitCol(value)[source]
Parameters

unitCol – Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.

stepSize = Param(parent='undefined', name='stepSize', doc='Step size to be used for each iteration of optimization (> 0)')
timeCol = Param(parent='undefined', name='timeCol', doc='Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.')
tol = Param(parent='undefined', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0)')
treatmentCol = Param(parent='undefined', name='treatmentCol', doc='treatment column')
unitCol = Param(parent='undefined', name='unitCol', doc='Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.')

synapse.ml.causal.SyntheticDiffInDiffEstimator module

class synapse.ml.causal.SyntheticDiffInDiffEstimator.SyntheticDiffInDiffEstimator(java_obj=None, epsilon=1e-10, handleMissingOutcome='zero', localSolverThreshold=1000000, maxIter=100, numIterNoChange=None, outcomeCol=None, postTreatmentCol=None, stepSize=1.0, timeCol=None, tol=0.001, treatmentCol=None, unitCol=None, zeta=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • epsilon (float) – This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.

  • handleMissingOutcome (str) – How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)

  • localSolverThreshold (long) – threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.

  • maxIter (int) – maximum number of iterations (>= 0)

  • numIterNoChange (int) – Early termination when number of iterations without change reached.

  • outcomeCol (str) – outcome column

  • postTreatmentCol (str) – post treatment indicator column

  • stepSize (float) – Step size to be used for each iteration of optimization (> 0)

  • timeCol (str) – Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.

  • tol (float) – the convergence tolerance for iterative algorithms (>= 0)

  • treatmentCol (str) – treatment column

  • unitCol (str) – Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.

  • zeta (float) – The zeta value for regularization term when fitting unit weights. If not specified, a default value will be computed based on formula (2.2) specified in https://www.nber.org/system/files/working_papers/w25532/w25532.pdf. For large scale data, one may want to tune the zeta value, minimizing the loss of the unit weights regression.

epsilon = Param(parent='undefined', name='epsilon', doc='This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.')
getEpsilon()[source]
Returns

This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.

Return type

epsilon

getHandleMissingOutcome()[source]
Returns

How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)

Return type

handleMissingOutcome

static getJavaPackage()[source]

Returns package name String.

getLocalSolverThreshold()[source]
Returns

threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.

Return type

localSolverThreshold

getMaxIter()[source]
Returns

maximum number of iterations (>= 0)

Return type

maxIter

getNumIterNoChange()[source]
Returns

Early termination when number of iterations without change reached.

Return type

numIterNoChange

getOutcomeCol()[source]
Returns

outcome column

Return type

outcomeCol

getPostTreatmentCol()[source]
Returns

post treatment indicator column

Return type

postTreatmentCol

getStepSize()[source]
Returns

Step size to be used for each iteration of optimization (> 0)

Return type

stepSize

getTimeCol()[source]
Returns

Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.

Return type

timeCol

getTol()[source]
Returns

the convergence tolerance for iterative algorithms (>= 0)

Return type

tol

getTreatmentCol()[source]
Returns

treatment column

Return type

treatmentCol

getUnitCol()[source]
Returns

Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.

Return type

unitCol

getZeta()[source]
Returns

The zeta value for regularization term when fitting unit weights. If not specified, a default value will be computed based on formula (2.2) specified in https://www.nber.org/system/files/working_papers/w25532/w25532.pdf. For large scale data, one may want to tune the zeta value, minimizing the loss of the unit weights regression.

Return type

zeta

handleMissingOutcome = Param(parent='undefined', name='handleMissingOutcome', doc='How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)')
localSolverThreshold = Param(parent='undefined', name='localSolverThreshold', doc='threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.')
maxIter = Param(parent='undefined', name='maxIter', doc='maximum number of iterations (>= 0)')
numIterNoChange = Param(parent='undefined', name='numIterNoChange', doc='Early termination when number of iterations without change reached.')
outcomeCol = Param(parent='undefined', name='outcomeCol', doc='outcome column')
postTreatmentCol = Param(parent='undefined', name='postTreatmentCol', doc='post treatment indicator column')
classmethod read()[source]

Returns an MLReader instance for this class.

setEpsilon(value)[source]
Parameters

epsilon – This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.

setHandleMissingOutcome(value)[source]
Parameters

handleMissingOutcome – How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)

setLocalSolverThreshold(value)[source]
Parameters

localSolverThreshold – threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.

setMaxIter(value)[source]
Parameters

maxIter – maximum number of iterations (>= 0)

setNumIterNoChange(value)[source]
Parameters

numIterNoChange – Early termination when number of iterations without change reached.

setOutcomeCol(value)[source]
Parameters

outcomeCol – outcome column

setParams(epsilon=1e-10, handleMissingOutcome='zero', localSolverThreshold=1000000, maxIter=100, numIterNoChange=None, outcomeCol=None, postTreatmentCol=None, stepSize=1.0, timeCol=None, tol=0.001, treatmentCol=None, unitCol=None, zeta=None)[source]

Set the (keyword only) parameters

setPostTreatmentCol(value)[source]
Parameters

postTreatmentCol – post treatment indicator column

setStepSize(value)[source]
Parameters

stepSize – Step size to be used for each iteration of optimization (> 0)

setTimeCol(value)[source]
Parameters

timeCol – Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.

setTol(value)[source]
Parameters

tol – the convergence tolerance for iterative algorithms (>= 0)

setTreatmentCol(value)[source]
Parameters

treatmentCol – treatment column

setUnitCol(value)[source]
Parameters

unitCol – Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.

setZeta(value)[source]
Parameters

zeta – The zeta value for regularization term when fitting unit weights. If not specified, a default value will be computed based on formula (2.2) specified in https://www.nber.org/system/files/working_papers/w25532/w25532.pdf. For large scale data, one may want to tune the zeta value, minimizing the loss of the unit weights regression.

stepSize = Param(parent='undefined', name='stepSize', doc='Step size to be used for each iteration of optimization (> 0)')
timeCol = Param(parent='undefined', name='timeCol', doc='Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.')
tol = Param(parent='undefined', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0)')
treatmentCol = Param(parent='undefined', name='treatmentCol', doc='treatment column')
unitCol = Param(parent='undefined', name='unitCol', doc='Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.')
zeta = Param(parent='undefined', name='zeta', doc='The zeta value for regularization term when fitting unit weights. If not specified, a default value will be computed based on formula (2.2) specified in https://www.nber.org/system/files/working_papers/w25532/w25532.pdf. For large scale data, one may want to tune the zeta value, minimizing the loss of the unit weights regression.')

Module contents

SynapseML is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark in several new directions. SynapseML adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV. These tools enable powerful and highly-scalable predictive and analytical models for a variety of datasources.

SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. In this vein, SynapseML provides easy to use SparkML transformers for a wide variety of Microsoft Cognitive Services. For production grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.

SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+.