synapse.ml.isolationforest package

Submodules

synapse.ml.isolationforest.IsolationForest module

class synapse.ml.isolationforest.IsolationForest.IsolationForest(java_obj=None, bootstrap=False, contamination=0.0, contaminationError=0.0, featuresCol='features', maxFeatures=1.0, maxSamples=256.0, numEstimators=100, predictionCol='predictedLabel', randomSeed=1, scoreCol='outlierScore')[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters

bootstrap¶ (bool) – If true, draw sample for each tree with replacement. If false, do not sample with replacement.
contamination¶ (float) – The fraction of outliers in the training data set. If this is set to 0.0, it speeds up the training and all predicted labels will be false. The model and outlier scores are otherwise unaffected by this parameter.
contaminationError¶ (float) – The error allowed when calculating the threshold required to achieve the specified contamination fraction. The default is 0.0, which forces an exact calculation of the threshold. The exact calculation is slow and can fail for large datasets. If there are issues with the exact calculation, a good choice for this parameter is often 1% of the specified contamination value.
featuresCol¶ (str) – The feature vector.
maxFeatures¶ (float) – The number of features used to train each tree. If this value is between 0.0 and 1.0, then it is treated as a fraction. If it is >1.0, then it is treated as a count.
maxSamples¶ (float) – The number of samples used to train each tree. If this value is between 0.0 and 1.0, then it is treated as a fraction. If it is >1.0, then it is treated as a count.
numEstimators¶ (int) – The number of trees in the ensemble.
predictionCol¶ (str) – The predicted label.
randomSeed¶ (long) – The seed used for the random number generator.
scoreCol¶ (str) – The outlier score.

bootstrap = Param(parent='undefined', name='bootstrap', doc='If true, draw sample for each tree with replacement. If false, do not sample with replacement.')

contamination = Param(parent='undefined', name='contamination', doc='The fraction of outliers in the training data set. If this is set to 0.0, it speeds up the training and all predicted labels will be false. The model and outlier scores are otherwise unaffected by this parameter.')

contaminationError = Param(parent='undefined', name='contaminationError', doc='The error allowed when calculating the threshold required to achieve the specified contamination fraction. The default is 0.0, which forces an exact calculation of the threshold. The exact calculation is slow and can fail for large datasets. If there are issues with the exact calculation, a good choice for this parameter is often 1% of the specified contamination value.')

featuresCol = Param(parent='undefined', name='featuresCol', doc='The feature vector.')

getBootstrap()[source]

Returns: If true, draw sample for each tree with replacement. If false, do not sample with replacement.
Return type: bootstrap

getContamination()[source]

Returns: The fraction of outliers in the training data set. If this is set to 0.0, it speeds up the training and all predicted labels will be false. The model and outlier scores are otherwise unaffected by this parameter.
Return type: contamination

getContaminationError()[source]

Returns: The error allowed when calculating the threshold required to achieve the specified contamination fraction. The default is 0.0, which forces an exact calculation of the threshold. The exact calculation is slow and can fail for large datasets. If there are issues with the exact calculation, a good choice for this parameter is often 1% of the specified contamination value.
Return type: contaminationError

getFeaturesCol()[source]

Returns: The feature vector.
Return type: featuresCol

static getJavaPackage()[source]: Returns package name String.

getMaxFeatures()[source]

Returns: The number of features used to train each tree. If this value is between 0.0 and 1.0, then it is treated as a fraction. If it is >1.0, then it is treated as a count.
Return type: maxFeatures

getMaxSamples()[source]

Returns: The number of samples used to train each tree. If this value is between 0.0 and 1.0, then it is treated as a fraction. If it is >1.0, then it is treated as a count.
Return type: maxSamples

getNumEstimators()[source]

Returns: The number of trees in the ensemble.
Return type: numEstimators

getPredictionCol()[source]

Returns: The predicted label.
Return type: predictionCol

getRandomSeed()[source]

Returns: The seed used for the random number generator.
Return type: randomSeed

getScoreCol()[source]

Returns: The outlier score.
Return type: scoreCol

maxFeatures = Param(parent='undefined', name='maxFeatures', doc='The number of features used to train each tree. If this value is between 0.0 and 1.0, then it is treated as a fraction. If it is >1.0, then it is treated as a count.')

maxSamples = Param(parent='undefined', name='maxSamples', doc='The number of samples used to train each tree. If this value is between 0.0 and 1.0, then it is treated as a fraction. If it is >1.0, then it is treated as a count.')

numEstimators = Param(parent='undefined', name='numEstimators', doc='The number of trees in the ensemble.')

predictionCol = Param(parent='undefined', name='predictionCol', doc='The predicted label.')

randomSeed = Param(parent='undefined', name='randomSeed', doc='The seed used for the random number generator.')

classmethod read()[source]: Returns an MLReader instance for this class.

scoreCol = Param(parent='undefined', name='scoreCol', doc='The outlier score.')

setBootstrap(value)[source]

Parameters: bootstrap¶ – If true, draw sample for each tree with replacement. If false, do not sample with replacement.

setContamination(value)[source]

Parameters: contamination¶ – The fraction of outliers in the training data set. If this is set to 0.0, it speeds up the training and all predicted labels will be false. The model and outlier scores are otherwise unaffected by this parameter.

setContaminationError(value)[source]

Parameters: contaminationError¶ – The error allowed when calculating the threshold required to achieve the specified contamination fraction. The default is 0.0, which forces an exact calculation of the threshold. The exact calculation is slow and can fail for large datasets. If there are issues with the exact calculation, a good choice for this parameter is often 1% of the specified contamination value.

setFeaturesCol(value)[source]

Parameters: featuresCol¶ – The feature vector.

setMaxFeatures(value)[source]

Parameters: maxFeatures¶ – The number of features used to train each tree. If this value is between 0.0 and 1.0, then it is treated as a fraction. If it is >1.0, then it is treated as a count.

setMaxSamples(value)[source]

Parameters: maxSamples¶ – The number of samples used to train each tree. If this value is between 0.0 and 1.0, then it is treated as a fraction. If it is >1.0, then it is treated as a count.

setNumEstimators(value)[source]

Parameters: numEstimators¶ – The number of trees in the ensemble.

setParams(bootstrap=False, contamination=0.0, contaminationError=0.0, featuresCol='features', maxFeatures=1.0, maxSamples=256.0, numEstimators=100, predictionCol='predictedLabel', randomSeed=1, scoreCol='outlierScore')[source]: Set the (keyword only) parameters

setPredictionCol(value)[source]

Parameters: predictionCol¶ – The predicted label.

setRandomSeed(value)[source]

Parameters: randomSeed¶ – The seed used for the random number generator.

setScoreCol(value)[source]

Parameters: scoreCol¶ – The outlier score.

synapse.ml.isolationforest.IsolationForestModel module

class synapse.ml.isolationforest.IsolationForestModel.IsolationForestModel(java_obj=None, bootstrap=False, contamination=0.0, contaminationError=0.0, featuresCol='features', innerModel=None, maxFeatures=1.0, maxSamples=256.0, numEstimators=100, predictionCol='predictedLabel', randomSeed=1, scoreCol='outlierScore')[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

getInnerModel()[source]

Returns: the fit isolation forrest instance
Return type: innerModel

getOutlierScoreThreshold()[source]

Module contents

SynapseML is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark in several new directions. SynapseML adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV. These tools enable powerful and highly-scalable predictive and analytical models for a variety of datasources.

SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. In this vein, SynapseML provides easy to use SparkML transformers for a wide variety of Microsoft Cognitive Services. For production grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.

SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+.