synapse.ml.causal package
Submodules
synapse.ml.causal.DiffInDiffEstimator module
- class synapse.ml.causal.DiffInDiffEstimator.DiffInDiffEstimator(java_obj=None, outcomeCol=None, postTreatmentCol=None, treatmentCol=None)[source]
Bases:
ComplexParamsMixin
,JavaMLReadable
,JavaMLWritable
,JavaEstimator
- Parameters:
- getPostTreatmentCol()[source]
- Returns:
post treatment indicator column
- Return type:
postTreatmentCol
- outcomeCol = Param(parent='undefined', name='outcomeCol', doc='outcome column')
- postTreatmentCol = Param(parent='undefined', name='postTreatmentCol', doc='post treatment indicator column')
- setParams(outcomeCol=None, postTreatmentCol=None, treatmentCol=None)[source]
Set the (keyword only) parameters
- treatmentCol = Param(parent='undefined', name='treatmentCol', doc='treatment column')
synapse.ml.causal.DiffInDiffModel module
synapse.ml.causal.DoubleMLEstimator module
- class synapse.ml.causal.DoubleMLEstimator.DoubleMLEstimator(java_obj=None, confidenceLevel=0.975, featuresCol=None, maxIter=1, outcomeCol=None, outcomeModel=None, parallelism=10, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, weightCol=None)[source]
Bases:
ComplexParamsMixin
,JavaMLReadable
,JavaMLWritable
,JavaEstimator
- Parameters:
- confidenceLevel = Param(parent='undefined', name='confidenceLevel', doc='confidence level, default value is 0.975')
- featuresCol = Param(parent='undefined', name='featuresCol', doc='The name of the features column')
- getConfidenceLevel()[source]
- Returns:
confidence level, default value is 0.975
- Return type:
confidenceLevel
- getParallelism()[source]
- Returns:
the number of threads to use when running parallel algorithms
- Return type:
parallelism
- getSampleSplitRatio()[source]
- Returns:
Sample split ratio for cross-fitting. Default: [0.5, 0.5].
- Return type:
sampleSplitRatio
- maxIter = Param(parent='undefined', name='maxIter', doc='maximum number of iterations (>= 0)')
- outcomeCol = Param(parent='undefined', name='outcomeCol', doc='outcome column')
- outcomeModel = Param(parent='undefined', name='outcomeModel', doc='outcome model to run')
- parallelism = Param(parent='undefined', name='parallelism', doc='the number of threads to use when running parallel algorithms')
- sampleSplitRatio = Param(parent='undefined', name='sampleSplitRatio', doc='Sample split ratio for cross-fitting. Default: [0.5, 0.5].')
- setConfidenceLevel(value)[source]
- Parameters:
confidenceLevel¶ – confidence level, default value is 0.975
- setParallelism(value)[source]
- Parameters:
parallelism¶ – the number of threads to use when running parallel algorithms
- setParams(confidenceLevel=0.975, featuresCol=None, maxIter=1, outcomeCol=None, outcomeModel=None, parallelism=10, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, weightCol=None)[source]
Set the (keyword only) parameters
- setSampleSplitRatio(value)[source]
- Parameters:
sampleSplitRatio¶ – Sample split ratio for cross-fitting. Default: [0.5, 0.5].
- treatmentCol = Param(parent='undefined', name='treatmentCol', doc='treatment column')
- treatmentModel = Param(parent='undefined', name='treatmentModel', doc='treatment model to run')
- weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')
synapse.ml.causal.DoubleMLModel module
- class synapse.ml.causal.DoubleMLModel.DoubleMLModel(java_obj=None, confidenceLevel=0.975, featuresCol=None, maxIter=1, outcomeCol=None, outcomeModel=None, parallelism=10, rawTreatmentEffects=None, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, weightCol=None)[source]
Bases:
_DoubleMLModel
synapse.ml.causal.OrthoForestDMLEstimator module
- class synapse.ml.causal.OrthoForestDMLEstimator.OrthoForestDMLEstimator(java_obj=None, confidenceLevel=0.975, confounderVecCol='XW', featuresCol=None, heterogeneityVecCol='X', maxDepth=5, maxIter=1, minSamplesLeaf=10, numTrees=20, outcomeCol=None, outcomeModel=None, outcomeResidualCol='OutcomeResidual', outputCol='EffectAverage', outputHighCol='EffectUpperBound', outputLowCol='EffectLowerBound', parallelism=10, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, treatmentResidualCol='TreatmentResidual', weightCol=None)[source]
Bases:
ComplexParamsMixin
,JavaMLReadable
,JavaMLWritable
,JavaEstimator
- Parameters:
- confidenceLevel = Param(parent='undefined', name='confidenceLevel', doc='confidence level, default value is 0.975')
- confounderVecCol = Param(parent='undefined', name='confounderVecCol', doc='Confounders to control for')
- featuresCol = Param(parent='undefined', name='featuresCol', doc='The name of the features column')
- getConfidenceLevel()[source]
- Returns:
confidence level, default value is 0.975
- Return type:
confidenceLevel
- getHeterogeneityVecCol()[source]
- Returns:
Vector to divide the treatment by
- Return type:
heterogeneityVecCol
- getParallelism()[source]
- Returns:
the number of threads to use when running parallel algorithms
- Return type:
parallelism
- getSampleSplitRatio()[source]
- Returns:
Sample split ratio for cross-fitting. Default: [0.5, 0.5].
- Return type:
sampleSplitRatio
- getTreatmentResidualCol()[source]
- Returns:
Treatment Residual Column
- Return type:
treatmentResidualCol
- heterogeneityVecCol = Param(parent='undefined', name='heterogeneityVecCol', doc='Vector to divide the treatment by')
- maxDepth = Param(parent='undefined', name='maxDepth', doc='Max Depth of Tree')
- maxIter = Param(parent='undefined', name='maxIter', doc='maximum number of iterations (>= 0)')
- minSamplesLeaf = Param(parent='undefined', name='minSamplesLeaf', doc='Max Depth of Tree')
- numTrees = Param(parent='undefined', name='numTrees', doc='Number of trees')
- outcomeCol = Param(parent='undefined', name='outcomeCol', doc='outcome column')
- outcomeModel = Param(parent='undefined', name='outcomeModel', doc='outcome model to run')
- outcomeResidualCol = Param(parent='undefined', name='outcomeResidualCol', doc='Outcome Residual Column')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
- outputHighCol = Param(parent='undefined', name='outputHighCol', doc='Output Confidence Interval Low')
- outputLowCol = Param(parent='undefined', name='outputLowCol', doc='Output Confidence Interval Low')
- parallelism = Param(parent='undefined', name='parallelism', doc='the number of threads to use when running parallel algorithms')
- sampleSplitRatio = Param(parent='undefined', name='sampleSplitRatio', doc='Sample split ratio for cross-fitting. Default: [0.5, 0.5].')
- setConfidenceLevel(value)[source]
- Parameters:
confidenceLevel¶ – confidence level, default value is 0.975
- setHeterogeneityVecCol(value)[source]
- Parameters:
heterogeneityVecCol¶ – Vector to divide the treatment by
- setParallelism(value)[source]
- Parameters:
parallelism¶ – the number of threads to use when running parallel algorithms
- setParams(confidenceLevel=0.975, confounderVecCol='XW', featuresCol=None, heterogeneityVecCol='X', maxDepth=5, maxIter=1, minSamplesLeaf=10, numTrees=20, outcomeCol=None, outcomeModel=None, outcomeResidualCol='OutcomeResidual', outputCol='EffectAverage', outputHighCol='EffectUpperBound', outputLowCol='EffectLowerBound', parallelism=10, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, treatmentResidualCol='TreatmentResidual', weightCol=None)[source]
Set the (keyword only) parameters
- setSampleSplitRatio(value)[source]
- Parameters:
sampleSplitRatio¶ – Sample split ratio for cross-fitting. Default: [0.5, 0.5].
- setTreatmentResidualCol(value)[source]
- Parameters:
treatmentResidualCol¶ – Treatment Residual Column
- treatmentCol = Param(parent='undefined', name='treatmentCol', doc='treatment column')
- treatmentModel = Param(parent='undefined', name='treatmentModel', doc='treatment model to run')
- treatmentResidualCol = Param(parent='undefined', name='treatmentResidualCol', doc='Treatment Residual Column')
- weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')
synapse.ml.causal.OrthoForestDMLModel module
- class synapse.ml.causal.OrthoForestDMLModel.OrthoForestDMLModel(java_obj=None, confidenceLevel=0.975, confounderVecCol='XW', featuresCol=None, forest=None, heterogeneityVecCol='X', maxDepth=5, maxIter=1, minSamplesLeaf=10, numTrees=20, outcomeCol=None, outcomeModel=None, outcomeResidualCol='OutcomeResidual', outputCol='EffectAverage', outputHighCol='EffectUpperBound', outputLowCol='EffectLowerBound', parallelism=10, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, treatmentResidualCol='TreatmentResidual', weightCol=None)[source]
Bases:
ComplexParamsMixin
,JavaMLReadable
,JavaMLWritable
,JavaModel
- Parameters:
confidenceLevel¶ (float) – confidence level, default value is 0.975
forest¶ (object) – Forest Trees produced in Ortho Forest DML Estimator
heterogeneityVecCol¶ (str) – Vector to divide the treatment by
parallelism¶ (int) – the number of threads to use when running parallel algorithms
sampleSplitRatio¶ (list) – Sample split ratio for cross-fitting. Default: [0.5, 0.5].
- confidenceLevel = Param(parent='undefined', name='confidenceLevel', doc='confidence level, default value is 0.975')
- confounderVecCol = Param(parent='undefined', name='confounderVecCol', doc='Confounders to control for')
- featuresCol = Param(parent='undefined', name='featuresCol', doc='The name of the features column')
- forest = Param(parent='undefined', name='forest', doc='Forest Trees produced in Ortho Forest DML Estimator')
- getConfidenceLevel()[source]
- Returns:
confidence level, default value is 0.975
- Return type:
confidenceLevel
- getForest()[source]
- Returns:
Forest Trees produced in Ortho Forest DML Estimator
- Return type:
forest
- getHeterogeneityVecCol()[source]
- Returns:
Vector to divide the treatment by
- Return type:
heterogeneityVecCol
- getParallelism()[source]
- Returns:
the number of threads to use when running parallel algorithms
- Return type:
parallelism
- getSampleSplitRatio()[source]
- Returns:
Sample split ratio for cross-fitting. Default: [0.5, 0.5].
- Return type:
sampleSplitRatio
- getTreatmentResidualCol()[source]
- Returns:
Treatment Residual Column
- Return type:
treatmentResidualCol
- heterogeneityVecCol = Param(parent='undefined', name='heterogeneityVecCol', doc='Vector to divide the treatment by')
- maxDepth = Param(parent='undefined', name='maxDepth', doc='Max Depth of Tree')
- maxIter = Param(parent='undefined', name='maxIter', doc='maximum number of iterations (>= 0)')
- minSamplesLeaf = Param(parent='undefined', name='minSamplesLeaf', doc='Max Depth of Tree')
- numTrees = Param(parent='undefined', name='numTrees', doc='Number of trees')
- outcomeCol = Param(parent='undefined', name='outcomeCol', doc='outcome column')
- outcomeModel = Param(parent='undefined', name='outcomeModel', doc='outcome model to run')
- outcomeResidualCol = Param(parent='undefined', name='outcomeResidualCol', doc='Outcome Residual Column')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
- outputHighCol = Param(parent='undefined', name='outputHighCol', doc='Output Confidence Interval Low')
- outputLowCol = Param(parent='undefined', name='outputLowCol', doc='Output Confidence Interval Low')
- parallelism = Param(parent='undefined', name='parallelism', doc='the number of threads to use when running parallel algorithms')
- sampleSplitRatio = Param(parent='undefined', name='sampleSplitRatio', doc='Sample split ratio for cross-fitting. Default: [0.5, 0.5].')
- setConfidenceLevel(value)[source]
- Parameters:
confidenceLevel¶ – confidence level, default value is 0.975
- setHeterogeneityVecCol(value)[source]
- Parameters:
heterogeneityVecCol¶ – Vector to divide the treatment by
- setParallelism(value)[source]
- Parameters:
parallelism¶ – the number of threads to use when running parallel algorithms
- setParams(confidenceLevel=0.975, confounderVecCol='XW', featuresCol=None, forest=None, heterogeneityVecCol='X', maxDepth=5, maxIter=1, minSamplesLeaf=10, numTrees=20, outcomeCol=None, outcomeModel=None, outcomeResidualCol='OutcomeResidual', outputCol='EffectAverage', outputHighCol='EffectUpperBound', outputLowCol='EffectLowerBound', parallelism=10, sampleSplitRatio=[0.5, 0.5], treatmentCol=None, treatmentModel=None, treatmentResidualCol='TreatmentResidual', weightCol=None)[source]
Set the (keyword only) parameters
- setSampleSplitRatio(value)[source]
- Parameters:
sampleSplitRatio¶ – Sample split ratio for cross-fitting. Default: [0.5, 0.5].
- setTreatmentResidualCol(value)[source]
- Parameters:
treatmentResidualCol¶ – Treatment Residual Column
- treatmentCol = Param(parent='undefined', name='treatmentCol', doc='treatment column')
- treatmentModel = Param(parent='undefined', name='treatmentModel', doc='treatment model to run')
- treatmentResidualCol = Param(parent='undefined', name='treatmentResidualCol', doc='Treatment Residual Column')
- weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')
synapse.ml.causal.OrthoForestVariableTransformer module
- class synapse.ml.causal.OrthoForestVariableTransformer.OrthoForestVariableTransformer(java_obj=None, outcomeResidualCol='OResid', outputCol='_tmp_tsOutcome', treatmentResidualCol='TResid', weightsCol='_tmp_twOutcome')[source]
Bases:
ComplexParamsMixin
,JavaMLReadable
,JavaMLWritable
,JavaTransformer
- Parameters:
- getTreatmentResidualCol()[source]
- Returns:
Treatment Residual Col
- Return type:
treatmentResidualCol
- outcomeResidualCol = Param(parent='undefined', name='outcomeResidualCol', doc='Outcome Residual Col')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
- setParams(outcomeResidualCol='OResid', outputCol='_tmp_tsOutcome', treatmentResidualCol='TResid', weightsCol='_tmp_twOutcome')[source]
Set the (keyword only) parameters
- treatmentResidualCol = Param(parent='undefined', name='treatmentResidualCol', doc='Treatment Residual Col')
- weightsCol = Param(parent='undefined', name='weightsCol', doc='Weights Col')
synapse.ml.causal.ResidualTransformer module
- class synapse.ml.causal.ResidualTransformer.ResidualTransformer(java_obj=None, classIndex=1, observedCol='label', outputCol='residual', predictedCol='prediction')[source]
Bases:
ComplexParamsMixin
,JavaMLReadable
,JavaMLWritable
,JavaTransformer
- Parameters:
- classIndex = Param(parent='undefined', name='classIndex', doc='The index of the class to compute residual for classification outputs. Default value is 1.')
- getClassIndex()[source]
- Returns:
The index of the class to compute residual for classification outputs. Default value is 1.
- Return type:
classIndex
- getPredictedCol()[source]
- Returns:
predicted data (prediction or probability columns
- Return type:
predictedCol
- observedCol = Param(parent='undefined', name='observedCol', doc='observed data (label column)')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
- predictedCol = Param(parent='undefined', name='predictedCol', doc='predicted data (prediction or probability columns')
- setClassIndex(value)[source]
- Parameters:
classIndex¶ – The index of the class to compute residual for classification outputs. Default value is 1.
synapse.ml.causal.SyntheticControlEstimator module
- class synapse.ml.causal.SyntheticControlEstimator.SyntheticControlEstimator(java_obj=None, epsilon=1e-10, handleMissingOutcome='zero', localSolverThreshold=1000000, maxIter=100, numIterNoChange=None, outcomeCol=None, postTreatmentCol=None, stepSize=1.0, timeCol=None, tol=0.001, treatmentCol=None, unitCol=None)[source]
Bases:
ComplexParamsMixin
,JavaMLReadable
,JavaMLWritable
,JavaEstimator
- Parameters:
epsilon¶ (float) – This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.
handleMissingOutcome¶ (str) – How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)
localSolverThreshold¶ (long) – threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.
numIterNoChange¶ (int) – Early termination when number of iterations without change reached.
stepSize¶ (float) – Step size to be used for each iteration of optimization (> 0)
timeCol¶ (str) – Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.
tol¶ (float) – the convergence tolerance for iterative algorithms (>= 0)
unitCol¶ (str) – Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.
- epsilon = Param(parent='undefined', name='epsilon', doc='This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.')
- getEpsilon()[source]
- Returns:
This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.
- Return type:
epsilon
- getHandleMissingOutcome()[source]
- Returns:
How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)
- Return type:
handleMissingOutcome
- getLocalSolverThreshold()[source]
- Returns:
threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.
- Return type:
localSolverThreshold
- getNumIterNoChange()[source]
- Returns:
Early termination when number of iterations without change reached.
- Return type:
numIterNoChange
- getPostTreatmentCol()[source]
- Returns:
post treatment indicator column
- Return type:
postTreatmentCol
- getStepSize()[source]
- Returns:
Step size to be used for each iteration of optimization (> 0)
- Return type:
stepSize
- getTimeCol()[source]
- Returns:
Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.
- Return type:
timeCol
- getTol()[source]
- Returns:
the convergence tolerance for iterative algorithms (>= 0)
- Return type:
tol
- getUnitCol()[source]
- Returns:
Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.
- Return type:
unitCol
- handleMissingOutcome = Param(parent='undefined', name='handleMissingOutcome', doc='How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)')
- localSolverThreshold = Param(parent='undefined', name='localSolverThreshold', doc='threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.')
- maxIter = Param(parent='undefined', name='maxIter', doc='maximum number of iterations (>= 0)')
- numIterNoChange = Param(parent='undefined', name='numIterNoChange', doc='Early termination when number of iterations without change reached.')
- outcomeCol = Param(parent='undefined', name='outcomeCol', doc='outcome column')
- postTreatmentCol = Param(parent='undefined', name='postTreatmentCol', doc='post treatment indicator column')
- setEpsilon(value)[source]
- Parameters:
epsilon¶ – This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.
- setHandleMissingOutcome(value)[source]
- Parameters:
handleMissingOutcome¶ – How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)
- setLocalSolverThreshold(value)[source]
- Parameters:
localSolverThreshold¶ – threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.
- setNumIterNoChange(value)[source]
- Parameters:
numIterNoChange¶ – Early termination when number of iterations without change reached.
- setParams(epsilon=1e-10, handleMissingOutcome='zero', localSolverThreshold=1000000, maxIter=100, numIterNoChange=None, outcomeCol=None, postTreatmentCol=None, stepSize=1.0, timeCol=None, tol=0.001, treatmentCol=None, unitCol=None)[source]
Set the (keyword only) parameters
- setStepSize(value)[source]
- Parameters:
stepSize¶ – Step size to be used for each iteration of optimization (> 0)
- setTimeCol(value)[source]
- Parameters:
timeCol¶ – Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.
- setUnitCol(value)[source]
- Parameters:
unitCol¶ – Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.
- stepSize = Param(parent='undefined', name='stepSize', doc='Step size to be used for each iteration of optimization (> 0)')
- timeCol = Param(parent='undefined', name='timeCol', doc='Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.')
- tol = Param(parent='undefined', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0)')
- treatmentCol = Param(parent='undefined', name='treatmentCol', doc='treatment column')
- unitCol = Param(parent='undefined', name='unitCol', doc='Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.')
synapse.ml.causal.SyntheticDiffInDiffEstimator module
- class synapse.ml.causal.SyntheticDiffInDiffEstimator.SyntheticDiffInDiffEstimator(java_obj=None, epsilon=1e-10, handleMissingOutcome='zero', localSolverThreshold=1000000, maxIter=100, numIterNoChange=None, outcomeCol=None, postTreatmentCol=None, stepSize=1.0, timeCol=None, tol=0.001, treatmentCol=None, unitCol=None, zeta=None)[source]
Bases:
ComplexParamsMixin
,JavaMLReadable
,JavaMLWritable
,JavaEstimator
- Parameters:
epsilon¶ (float) – This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.
handleMissingOutcome¶ (str) – How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)
localSolverThreshold¶ (long) – threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.
numIterNoChange¶ (int) – Early termination when number of iterations without change reached.
stepSize¶ (float) – Step size to be used for each iteration of optimization (> 0)
timeCol¶ (str) – Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.
tol¶ (float) – the convergence tolerance for iterative algorithms (>= 0)
unitCol¶ (str) – Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.
zeta¶ (float) – The zeta value for regularization term when fitting unit weights. If not specified, a default value will be computed based on formula (2.2) specified in https://www.nber.org/system/files/working_papers/w25532/w25532.pdf. For large scale data, one may want to tune the zeta value, minimizing the loss of the unit weights regression.
- epsilon = Param(parent='undefined', name='epsilon', doc='This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.')
- getEpsilon()[source]
- Returns:
This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.
- Return type:
epsilon
- getHandleMissingOutcome()[source]
- Returns:
How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)
- Return type:
handleMissingOutcome
- getLocalSolverThreshold()[source]
- Returns:
threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.
- Return type:
localSolverThreshold
- getNumIterNoChange()[source]
- Returns:
Early termination when number of iterations without change reached.
- Return type:
numIterNoChange
- getPostTreatmentCol()[source]
- Returns:
post treatment indicator column
- Return type:
postTreatmentCol
- getStepSize()[source]
- Returns:
Step size to be used for each iteration of optimization (> 0)
- Return type:
stepSize
- getTimeCol()[source]
- Returns:
Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.
- Return type:
timeCol
- getTol()[source]
- Returns:
the convergence tolerance for iterative algorithms (>= 0)
- Return type:
tol
- getUnitCol()[source]
- Returns:
Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.
- Return type:
unitCol
- getZeta()[source]
- Returns:
The zeta value for regularization term when fitting unit weights. If not specified, a default value will be computed based on formula (2.2) specified in https://www.nber.org/system/files/working_papers/w25532/w25532.pdf. For large scale data, one may want to tune the zeta value, minimizing the loss of the unit weights regression.
- Return type:
zeta
- handleMissingOutcome = Param(parent='undefined', name='handleMissingOutcome', doc='How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)')
- localSolverThreshold = Param(parent='undefined', name='localSolverThreshold', doc='threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.')
- maxIter = Param(parent='undefined', name='maxIter', doc='maximum number of iterations (>= 0)')
- numIterNoChange = Param(parent='undefined', name='numIterNoChange', doc='Early termination when number of iterations without change reached.')
- outcomeCol = Param(parent='undefined', name='outcomeCol', doc='outcome column')
- postTreatmentCol = Param(parent='undefined', name='postTreatmentCol', doc='post treatment indicator column')
- setEpsilon(value)[source]
- Parameters:
epsilon¶ – This value is added to the weights when we fit the final linear model for SyntheticControlEstimator and SyntheticDiffInDiffEstimator in order to avoid zero weights.
- setHandleMissingOutcome(value)[source]
- Parameters:
handleMissingOutcome¶ – How to handle missing outcomes. Options are skip (which will filter out units with missing outcomes), zero (fill in missing outcomes with zero), or impute (impute with nearest available outcomes, or mean if two nearest outcomes are available)
- setLocalSolverThreshold(value)[source]
- Parameters:
localSolverThreshold¶ – threshold for using local solver on driver node. Local solver is faster but relies on part of data being collected on driver node.
- setNumIterNoChange(value)[source]
- Parameters:
numIterNoChange¶ – Early termination when number of iterations without change reached.
- setParams(epsilon=1e-10, handleMissingOutcome='zero', localSolverThreshold=1000000, maxIter=100, numIterNoChange=None, outcomeCol=None, postTreatmentCol=None, stepSize=1.0, timeCol=None, tol=0.001, treatmentCol=None, unitCol=None, zeta=None)[source]
Set the (keyword only) parameters
- setStepSize(value)[source]
- Parameters:
stepSize¶ – Step size to be used for each iteration of optimization (> 0)
- setTimeCol(value)[source]
- Parameters:
timeCol¶ – Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.
- setUnitCol(value)[source]
- Parameters:
unitCol¶ – Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.
- setZeta(value)[source]
- Parameters:
zeta¶ – The zeta value for regularization term when fitting unit weights. If not specified, a default value will be computed based on formula (2.2) specified in https://www.nber.org/system/files/working_papers/w25532/w25532.pdf. For large scale data, one may want to tune the zeta value, minimizing the loss of the unit weights regression.
- stepSize = Param(parent='undefined', name='stepSize', doc='Step size to be used for each iteration of optimization (> 0)')
- timeCol = Param(parent='undefined', name='timeCol', doc='Specify the column that identifies the time when outcome is measured in the panel data. For example, if the outcome is measured daily, this column could be the Date column.')
- tol = Param(parent='undefined', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0)')
- treatmentCol = Param(parent='undefined', name='treatmentCol', doc='treatment column')
- unitCol = Param(parent='undefined', name='unitCol', doc='Specify the name of the column which contains an identifier for each observed unit in the panel data. For example, if the observed units are users, this column could be the UserId column.')
- zeta = Param(parent='undefined', name='zeta', doc='The zeta value for regularization term when fitting unit weights. If not specified, a default value will be computed based on formula (2.2) specified in https://www.nber.org/system/files/working_papers/w25532/w25532.pdf. For large scale data, one may want to tune the zeta value, minimizing the loss of the unit weights regression.')
Module contents
SynapseML is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark in several new directions. SynapseML adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV. These tools enable powerful and highly-scalable predictive and analytical models for a variety of datasources.
SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. In this vein, SynapseML provides easy to use SparkML transformers for a wide variety of Microsoft Cognitive Services. For production grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.
SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+.