synapse.ml.causal package
Submodules
synapse.ml.causal.DiffInDiffEstimator module
- class synapse.ml.causal.DiffInDiffEstimator.DiffInDiffEstimator(java_obj=None, outcomeCol=None, postTreatmentCol=None, treatmentCol=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- outcomeCol = Param(parent='undefined', name='outcomeCol', doc='outcome column')
- postTreatmentCol = Param(parent='undefined', name='postTreatmentCol', doc='post treatment indicator column')
- setParams(outcomeCol=None, postTreatmentCol=None, treatmentCol=None)[source]
Set the (keyword only) parameters
- treatmentCol = Param(parent='undefined', name='treatmentCol', doc='treatment column')
synapse.ml.causal.DiffInDiffModel module
synapse.ml.causal.DoubleMLEstimator module
- class synapse.ml.causal.DoubleMLEstimator.DoubleMLEstimator(java_obj=None, confidenceLevel=0.975, featuresCol=None, maxIter=1, outcomeCol=None, outcomeModel=None, parallelism=10, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, weightCol=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- confidenceLevel = Param(parent='undefined', name='confidenceLevel', doc='confidence level, default value is 0.975')
- featuresCol = Param(parent='undefined', name='featuresCol', doc='The name of the features column')
- getConfidenceLevel()[source]
- Returns
confidence level, default value is 0.975
- Return type
confidenceLevel
- getParallelism()[source]
- Returns
the number of threads to use when running parallel algorithms
- Return type
parallelism
- getSampleSplitRatio()[source]
- Returns
Sample split ratio for cross-fitting. Default: [0.5, 0.5].
- Return type
sampleSplitRatio
- maxIter = Param(parent='undefined', name='maxIter', doc='maximum number of iterations (>= 0)')
- outcomeCol = Param(parent='undefined', name='outcomeCol', doc='outcome column')
- outcomeModel = Param(parent='undefined', name='outcomeModel', doc='outcome model to run')
- parallelism = Param(parent='undefined', name='parallelism', doc='the number of threads to use when running parallel algorithms')
- sampleSplitRatio = Param(parent='undefined', name='sampleSplitRatio', doc='Sample split ratio for cross-fitting. Default: [0.5, 0.5].')
- setConfidenceLevel(value)[source]
- Parameters
confidenceLevel¶ – confidence level, default value is 0.975
- setParallelism(value)[source]
- Parameters
parallelism¶ – the number of threads to use when running parallel algorithms
- setParams(confidenceLevel=0.975, featuresCol=None, maxIter=1, outcomeCol=None, outcomeModel=None, parallelism=10, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, weightCol=None)[source]
Set the (keyword only) parameters
- setSampleSplitRatio(value)[source]
- Parameters
sampleSplitRatio¶ – Sample split ratio for cross-fitting. Default: [0.5, 0.5].
- treatmentCol = Param(parent='undefined', name='treatmentCol', doc='treatment column')
- treatmentModel = Param(parent='undefined', name='treatmentModel', doc='treatment model to run')
- weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')
synapse.ml.causal.DoubleMLModel module
- class synapse.ml.causal.DoubleMLModel.DoubleMLModel(java_obj=None, confidenceLevel=0.975, featuresCol=None, maxIter=1, outcomeCol=None, outcomeModel=None, parallelism=10, rawTreatmentEffects=None, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, weightCol=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]
synapse.ml.causal.OrthoForestDMLEstimator module
- class synapse.ml.causal.OrthoForestDMLEstimator.OrthoForestDMLEstimator(java_obj=None, confidenceLevel=0.975, confounderVecCol='XW', featuresCol=None, heterogeneityVecCol='X', maxDepth=5, maxIter=1, minSamplesLeaf=10, numTrees=20, outcomeCol=None, outcomeModel=None, outcomeResidualCol='OutcomeResidual', outputCol='EffectAverage', outputHighCol='EffectUpperBound', outputLowCol='EffectLowerBound', parallelism=10, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, treatmentResidualCol='TreatmentResidual', weightCol=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- confidenceLevel = Param(parent='undefined', name='confidenceLevel', doc='confidence level, default value is 0.975')
- confounderVecCol = Param(parent='undefined', name='confounderVecCol', doc='Confounders to control for')
- featuresCol = Param(parent='undefined', name='featuresCol', doc='The name of the features column')
- getConfidenceLevel()[source]
- Returns
confidence level, default value is 0.975
- Return type
confidenceLevel
- getHeterogeneityVecCol()[source]
- Returns
Vector to divide the treatment by
- Return type
heterogeneityVecCol
- getParallelism()[source]
- Returns
the number of threads to use when running parallel algorithms
- Return type
parallelism
- getSampleSplitRatio()[source]
- Returns
Sample split ratio for cross-fitting. Default: [0.5, 0.5].
- Return type
sampleSplitRatio
- getTreatmentResidualCol()[source]
- Returns
Treatment Residual Column
- Return type
treatmentResidualCol
- heterogeneityVecCol = Param(parent='undefined', name='heterogeneityVecCol', doc='Vector to divide the treatment by')
- maxDepth = Param(parent='undefined', name='maxDepth', doc='Max Depth of Tree')
- maxIter = Param(parent='undefined', name='maxIter', doc='maximum number of iterations (>= 0)')
- minSamplesLeaf = Param(parent='undefined', name='minSamplesLeaf', doc='Max Depth of Tree')
- numTrees = Param(parent='undefined', name='numTrees', doc='Number of trees')
- outcomeCol = Param(parent='undefined', name='outcomeCol', doc='outcome column')
- outcomeModel = Param(parent='undefined', name='outcomeModel', doc='outcome model to run')
- outcomeResidualCol = Param(parent='undefined', name='outcomeResidualCol', doc='Outcome Residual Column')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
- outputHighCol = Param(parent='undefined', name='outputHighCol', doc='Output Confidence Interval Low')
- outputLowCol = Param(parent='undefined', name='outputLowCol', doc='Output Confidence Interval Low')
- parallelism = Param(parent='undefined', name='parallelism', doc='the number of threads to use when running parallel algorithms')
- sampleSplitRatio = Param(parent='undefined', name='sampleSplitRatio', doc='Sample split ratio for cross-fitting. Default: [0.5, 0.5].')
- setConfidenceLevel(value)[source]
- Parameters
confidenceLevel¶ – confidence level, default value is 0.975
- setHeterogeneityVecCol(value)[source]
- Parameters
heterogeneityVecCol¶ – Vector to divide the treatment by
- setParallelism(value)[source]
- Parameters
parallelism¶ – the number of threads to use when running parallel algorithms
- setParams(confidenceLevel=0.975, confounderVecCol='XW', featuresCol=None, heterogeneityVecCol='X', maxDepth=5, maxIter=1, minSamplesLeaf=10, numTrees=20, outcomeCol=None, outcomeModel=None, outcomeResidualCol='OutcomeResidual', outputCol='EffectAverage', outputHighCol='EffectUpperBound', outputLowCol='EffectLowerBound', parallelism=10, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, treatmentResidualCol='TreatmentResidual', weightCol=None)[source]
Set the (keyword only) parameters
- setSampleSplitRatio(value)[source]
- Parameters
sampleSplitRatio¶ – Sample split ratio for cross-fitting. Default: [0.5, 0.5].
- setTreatmentResidualCol(value)[source]
- Parameters
treatmentResidualCol¶ – Treatment Residual Column
- treatmentCol = Param(parent='undefined', name='treatmentCol', doc='treatment column')
- treatmentModel = Param(parent='undefined', name='treatmentModel', doc='treatment model to run')
- treatmentResidualCol = Param(parent='undefined', name='treatmentResidualCol', doc='Treatment Residual Column')
- weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')
synapse.ml.causal.OrthoForestDMLModel module
- class synapse.ml.causal.OrthoForestDMLModel.OrthoForestDMLModel(java_obj=None, confidenceLevel=0.975, confounderVecCol='XW', featuresCol=None, forest=None, heterogeneityVecCol='X', maxDepth=5, maxIter=1, minSamplesLeaf=10, numTrees=20, outcomeCol=None, outcomeModel=None, outcomeResidualCol='OutcomeResidual', outputCol='EffectAverage', outputHighCol='EffectUpperBound', outputLowCol='EffectLowerBound', parallelism=10, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, treatmentResidualCol='TreatmentResidual', weightCol=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
confidenceLevel¶ (float) – confidence level, default value is 0.975
forest¶ (object) – Forest Trees produced in Ortho Forest DML Estimator
heterogeneityVecCol¶ (str) – Vector to divide the treatment by
parallelism¶ (int) – the number of threads to use when running parallel algorithms
sampleSplitRatio¶ (list) – Sample split ratio for cross-fitting. Default: [0.5, 0.5].
- confidenceLevel = Param(parent='undefined', name='confidenceLevel', doc='confidence level, default value is 0.975')
- confounderVecCol = Param(parent='undefined', name='confounderVecCol', doc='Confounders to control for')
- featuresCol = Param(parent='undefined', name='featuresCol', doc='The name of the features column')
- forest = Param(parent='undefined', name='forest', doc='Forest Trees produced in Ortho Forest DML Estimator')
- getConfidenceLevel()[source]
- Returns
confidence level, default value is 0.975
- Return type
confidenceLevel
- getHeterogeneityVecCol()[source]
- Returns
Vector to divide the treatment by
- Return type
heterogeneityVecCol
- getParallelism()[source]
- Returns
the number of threads to use when running parallel algorithms
- Return type
parallelism
- getSampleSplitRatio()[source]
- Returns
Sample split ratio for cross-fitting. Default: [0.5, 0.5].
- Return type
sampleSplitRatio
- getTreatmentResidualCol()[source]
- Returns
Treatment Residual Column
- Return type
treatmentResidualCol
- heterogeneityVecCol = Param(parent='undefined', name='heterogeneityVecCol', doc='Vector to divide the treatment by')
- maxDepth = Param(parent='undefined', name='maxDepth', doc='Max Depth of Tree')
- maxIter = Param(parent='undefined', name='maxIter', doc='maximum number of iterations (>= 0)')
- minSamplesLeaf = Param(parent='undefined', name='minSamplesLeaf', doc='Max Depth of Tree')
- numTrees = Param(parent='undefined', name='numTrees', doc='Number of trees')
- outcomeCol = Param(parent='undefined', name='outcomeCol', doc='outcome column')
- outcomeModel = Param(parent='undefined', name='outcomeModel', doc='outcome model to run')
- outcomeResidualCol = Param(parent='undefined', name='outcomeResidualCol', doc='Outcome Residual Column')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
- outputHighCol = Param(parent='undefined', name='outputHighCol', doc='Output Confidence Interval Low')
- outputLowCol = Param(parent='undefined', name='outputLowCol', doc='Output Confidence Interval Low')
- parallelism = Param(parent='undefined', name='parallelism', doc='the number of threads to use when running parallel algorithms')
- sampleSplitRatio = Param(parent='undefined', name='sampleSplitRatio', doc='Sample split ratio for cross-fitting. Default: [0.5, 0.5].')
- setConfidenceLevel(value)[source]
- Parameters
confidenceLevel¶ – confidence level, default value is 0.975
- setHeterogeneityVecCol(value)[source]
- Parameters
heterogeneityVecCol¶ – Vector to divide the treatment by
- setParallelism(value)[source]
- Parameters
parallelism¶ – the number of threads to use when running parallel algorithms
- setParams(confidenceLevel=0.975, confounderVecCol='XW', featuresCol=None, forest=None, heterogeneityVecCol='X', maxDepth=5, maxIter=1, minSamplesLeaf=10, numTrees=20, outcomeCol=None, outcomeModel=None, outcomeResidualCol='OutcomeResidual', outputCol='EffectAverage', outputHighCol='EffectUpperBound', outputLowCol='EffectLowerBound', parallelism=10, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, treatmentResidualCol='TreatmentResidual', weightCol=None)[source]
Set the (keyword only) parameters
- setSampleSplitRatio(value)[source]
- Parameters
sampleSplitRatio¶ – Sample split ratio for cross-fitting. Default: [0.5, 0.5].
- setTreatmentResidualCol(value)[source]
- Parameters
treatmentResidualCol¶ – Treatment Residual Column
- treatmentCol = Param(parent='undefined', name='treatmentCol', doc='treatment column')
- treatmentModel = Param(parent='undefined', name='treatmentModel', doc='treatment model to run')
- treatmentResidualCol = Param(parent='undefined', name='treatmentResidualCol', doc='Treatment Residual Column')
- weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')
synapse.ml.causal.OrthoForestVariableTransformer module
- class synapse.ml.causal.OrthoForestVariableTransformer.OrthoForestVariableTransformer(java_obj=None, outcomeResidualCol='OResid', outputCol='_tmp_tsOutcome', treatmentResidualCol='TResid', weightsCol='_tmp_twOutcome')[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- outcomeResidualCol = Param(parent='undefined', name='outcomeResidualCol', doc='Outcome Residual Col')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
- setParams(outcomeResidualCol='OResid', outputCol='_tmp_tsOutcome', treatmentResidualCol='TResid', weightsCol='_tmp_twOutcome')[source]
Set the (keyword only) parameters
- treatmentResidualCol = Param(parent='undefined', name='treatmentResidualCol', doc='Treatment Residual Col')
- weightsCol = Param(parent='undefined', name='weightsCol', doc='Weights Col')
synapse.ml.causal.ResidualTransformer module
- class synapse.ml.causal.ResidualTransformer.ResidualTransformer(java_obj=None, classIndex=1, observedCol='label', outputCol='residual', predictedCol='prediction')[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- classIndex = Param(parent='undefined', name='classIndex', doc='The index of the class to compute residual for classification outputs. Default value is 1.')
- getClassIndex()[source]
- Returns
The index of the class to compute residual for classification outputs. Default value is 1.
- Return type
classIndex
- getPredictedCol()[source]
- Returns
predicted data (prediction or probability columns
- Return type
predictedCol
- observedCol = Param(parent='undefined', name='observedCol', doc='observed data (label column)')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
- predictedCol = Param(parent='undefined', name='predictedCol', doc='predicted data (prediction or probability columns')
- setClassIndex(value)[source]
- Parameters
classIndex¶ – The index of the class to compute residual for classification outputs. Default value is 1.
synapse.ml.causal.SyntheticControlEstimator module
- class synapse.ml.causal.SyntheticControlEstimator.SyntheticControlEstimator(java_obj=None, epsilon=1e-10, handleMissingOutcome='zero', localSolverThreshold=1000000, maxIter=100, numIterNoChange=None, outcomeCol=None, postTreatmentCol=None, stepSize=1.0, timeCol=None, tol=0.001, treatmentCol=None, unitCol=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
epsilon¶ (float) – This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.
handleMissingOutcome¶ (str) – How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)
localSolverThreshold¶ (long) – threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.
numIterNoChange¶ (int) – Early termination when number of iterations without change reached.
stepSize¶ (float) – Step size to be used for each iteration of optimization (> 0)
timeCol¶ (str) – Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.
tol¶ (float) – the convergence tolerance for iterative algorithms (>= 0)
unitCol¶ (str) – Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.
- epsilon = Param(parent='undefined', name='epsilon', doc='This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.')
- getEpsilon()[source]
- Returns
This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.
- Return type
epsilon
- getHandleMissingOutcome()[source]
- Returns
How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)
- Return type
handleMissingOutcome
- getLocalSolverThreshold()[source]
- Returns
threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.
- Return type
localSolverThreshold
- getNumIterNoChange()[source]
- Returns
Early termination when number of iterations without change reached.
- Return type
numIterNoChange
- getStepSize()[source]
- Returns
Step size to be used for each iteration of optimization (> 0)
- Return type
stepSize
- getTimeCol()[source]
- Returns
Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.
- Return type
timeCol
- getUnitCol()[source]
- Returns
Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.
- Return type
unitCol
- handleMissingOutcome = Param(parent='undefined', name='handleMissingOutcome', doc='How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)')
- localSolverThreshold = Param(parent='undefined', name='localSolverThreshold', doc='threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.')
- maxIter = Param(parent='undefined', name='maxIter', doc='maximum number of iterations (>= 0)')
- numIterNoChange = Param(parent='undefined', name='numIterNoChange', doc='Early termination when number of iterations without change reached.')
- outcomeCol = Param(parent='undefined', name='outcomeCol', doc='outcome column')
- postTreatmentCol = Param(parent='undefined', name='postTreatmentCol', doc='post treatment indicator column')
- setEpsilon(value)[source]
- Parameters
epsilon¶ – This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.
- setHandleMissingOutcome(value)[source]
- Parameters
handleMissingOutcome¶ – How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)
- setLocalSolverThreshold(value)[source]
- Parameters
localSolverThreshold¶ – threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.
- setNumIterNoChange(value)[source]
- Parameters
numIterNoChange¶ – Early termination when number of iterations without change reached.
- setParams(epsilon=1e-10, handleMissingOutcome='zero', localSolverThreshold=1000000, maxIter=100, numIterNoChange=None, outcomeCol=None, postTreatmentCol=None, stepSize=1.0, timeCol=None, tol=0.001, treatmentCol=None, unitCol=None)[source]
Set the (keyword only) parameters
- setStepSize(value)[source]
- Parameters
stepSize¶ – Step size to be used for each iteration of optimization (> 0)
- setTimeCol(value)[source]
- Parameters
timeCol¶ – Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.
- setUnitCol(value)[source]
- Parameters
unitCol¶ – Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.
- stepSize = Param(parent='undefined', name='stepSize', doc='Step size to be used for each iteration of optimization (> 0)')
- timeCol = Param(parent='undefined', name='timeCol', doc='Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.')
- tol = Param(parent='undefined', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0)')
- treatmentCol = Param(parent='undefined', name='treatmentCol', doc='treatment column')
- unitCol = Param(parent='undefined', name='unitCol', doc='Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.')
synapse.ml.causal.SyntheticDiffInDiffEstimator module
- class synapse.ml.causal.SyntheticDiffInDiffEstimator.SyntheticDiffInDiffEstimator(java_obj=None, epsilon=1e-10, handleMissingOutcome='zero', localSolverThreshold=1000000, maxIter=100, numIterNoChange=None, outcomeCol=None, postTreatmentCol=None, stepSize=1.0, timeCol=None, tol=0.001, treatmentCol=None, unitCol=None, zeta=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
epsilon¶ (float) – This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.
handleMissingOutcome¶ (str) – How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)
localSolverThreshold¶ (long) – threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.
numIterNoChange¶ (int) – Early termination when number of iterations without change reached.
stepSize¶ (float) – Step size to be used for each iteration of optimization (> 0)
timeCol¶ (str) – Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.
tol¶ (float) – the convergence tolerance for iterative algorithms (>= 0)
unitCol¶ (str) – Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.
zeta¶ (float) – The zeta value for regularization term when fitting unit weights. If not specified, a default value will be computed based on formula (2.2) specified in https://www.nber.org/system/files/working_papers/w25532/w25532.pdf. For large scale data, one may want to tune the zeta value, minimizing the loss of the unit weights regression.
- epsilon = Param(parent='undefined', name='epsilon', doc='This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.')
- getEpsilon()[source]
- Returns
This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.
- Return type
epsilon
- getHandleMissingOutcome()[source]
- Returns
How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)
- Return type
handleMissingOutcome
- getLocalSolverThreshold()[source]
- Returns
threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.
- Return type
localSolverThreshold
- getNumIterNoChange()[source]
- Returns
Early termination when number of iterations without change reached.
- Return type
numIterNoChange
- getStepSize()[source]
- Returns
Step size to be used for each iteration of optimization (> 0)
- Return type
stepSize
- getTimeCol()[source]
- Returns
Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.
- Return type
timeCol
- getUnitCol()[source]
- Returns
Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.
- Return type
unitCol
- getZeta()[source]
- Returns
The zeta value for regularization term when fitting unit weights. If not specified, a default value will be computed based on formula (2.2) specified in https://www.nber.org/system/files/working_papers/w25532/w25532.pdf. For large scale data, one may want to tune the zeta value, minimizing the loss of the unit weights regression.
- Return type
zeta
- handleMissingOutcome = Param(parent='undefined', name='handleMissingOutcome', doc='How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)')
- localSolverThreshold = Param(parent='undefined', name='localSolverThreshold', doc='threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.')
- maxIter = Param(parent='undefined', name='maxIter', doc='maximum number of iterations (>= 0)')
- numIterNoChange = Param(parent='undefined', name='numIterNoChange', doc='Early termination when number of iterations without change reached.')
- outcomeCol = Param(parent='undefined', name='outcomeCol', doc='outcome column')
- postTreatmentCol = Param(parent='undefined', name='postTreatmentCol', doc='post treatment indicator column')
- setEpsilon(value)[source]
- Parameters
epsilon¶ – This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.
- setHandleMissingOutcome(value)[source]
- Parameters
handleMissingOutcome¶ – How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)
- setLocalSolverThreshold(value)[source]
- Parameters
localSolverThreshold¶ – threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.
- setNumIterNoChange(value)[source]
- Parameters
numIterNoChange¶ – Early termination when number of iterations without change reached.
- setParams(epsilon=1e-10, handleMissingOutcome='zero', localSolverThreshold=1000000, maxIter=100, numIterNoChange=None, outcomeCol=None, postTreatmentCol=None, stepSize=1.0, timeCol=None, tol=0.001, treatmentCol=None, unitCol=None, zeta=None)[source]
Set the (keyword only) parameters
- setStepSize(value)[source]
- Parameters
stepSize¶ – Step size to be used for each iteration of optimization (> 0)
- setTimeCol(value)[source]
- Parameters
timeCol¶ – Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.
- setUnitCol(value)[source]
- Parameters
unitCol¶ – Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.
- setZeta(value)[source]
- Parameters
zeta¶ – The zeta value for regularization term when fitting unit weights. If not specified, a default value will be computed based on formula (2.2) specified in https://www.nber.org/system/files/working_papers/w25532/w25532.pdf. For large scale data, one may want to tune the zeta value, minimizing the loss of the unit weights regression.
- stepSize = Param(parent='undefined', name='stepSize', doc='Step size to be used for each iteration of optimization (> 0)')
- timeCol = Param(parent='undefined', name='timeCol', doc='Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.')
- tol = Param(parent='undefined', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0)')
- treatmentCol = Param(parent='undefined', name='treatmentCol', doc='treatment column')
- unitCol = Param(parent='undefined', name='unitCol', doc='Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.')
- zeta = Param(parent='undefined', name='zeta', doc='The zeta value for regularization term when fitting unit weights. If not specified, a default value will be computed based on formula (2.2) specified in https://www.nber.org/system/files/working_papers/w25532/w25532.pdf. For large scale data, one may want to tune the zeta value, minimizing the loss of the unit weights regression.')
Module contents
SynapseML is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark in several new directions. SynapseML adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV. These tools enable powerful and highly-scalable predictive and analytical models for a variety of datasources.
SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. In this vein, SynapseML provides easy to use SparkML transformers for a wide variety of Microsoft Cognitive Services. For production grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.
SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+.