mmlspark.isolationforest package¶

Submodules¶

mmlspark.isolationforest.IsolationForest module¶

class mmlspark.isolationforest.IsolationForest.IsolationForest(*args, **kwargs)[source]¶

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaEstimator

Parameters

bootstrap (bool) – If true, draw sample for each tree with replacement. If false, do not sample with replacement. (default: false)
contamination (double) – The fraction of outliers in the training data set. If this is set to 0.0, it speeds up the training and all predicted labels will be false. The model and outlier scores are otherwise unaffected by this parameter. (default: 0.0)
contaminationError (double) – The error allowed when calculating the threshold required to achieve the specified contamination fraction. The default is 0.0, which forces an exact calculation of the threshold. The exact calculation is slow and can fail for large datasets. If there are issues with the exact calculation, a good choice for this parameter is often 1% of the specified contamination value. (default: 0.0)
featuresCol (str) – The feature vector. (default: features)
maxFeatures (double) – The number of features used to train each tree. If this value is between 0.0 and 1.0, then it is treated as a fraction. If it is >1.0, then it is treated as a count. (default: 1.0)
maxSamples (double) – The number of samples used to train each tree. If this value is between 0.0 and 1.0, then it is treated as a fraction. If it is >1.0, then it is treated as a count. (default: 256.0)
numEstimators (int) – The number of trees in the ensemble. (default: 100)
predictionCol (str) – The predicted label. (default: predictedLabel)
randomSeed (long) – The seed used for the random number generator. (default: 1)
scoreCol (str) – The outlier score. (default: outlierScore)

getBootstrap()[source]¶

Returns: If true, draw sample for each tree with replacement. If false, do not sample with replacement. (default: false)
Return type: bool

getContamination()[source]¶

Returns: The fraction of outliers in the training data set. If this is set to 0.0, it speeds up the training and all predicted labels will be false. The model and outlier scores are otherwise unaffected by this parameter. (default: 0.0)
Return type: double

getContaminationError()[source]¶

Returns: The error allowed when calculating the threshold required to achieve the specified contamination fraction. The default is 0.0, which forces an exact calculation of the threshold. The exact calculation is slow and can fail for large datasets. If there are issues with the exact calculation, a good choice for this parameter is often 1% of the specified contamination value. (default: 0.0)
Return type: double

getFeaturesCol()[source]¶

Returns: The feature vector. (default: features)
Return type: str

static getJavaPackage()[source]¶: Returns package name String.

getMaxFeatures()[source]¶

Returns: The number of features used to train each tree. If this value is between 0.0 and 1.0, then it is treated as a fraction. If it is >1.0, then it is treated as a count. (default: 1.0)
Return type: double

getMaxSamples()[source]¶

Returns: The number of samples used to train each tree. If this value is between 0.0 and 1.0, then it is treated as a fraction. If it is >1.0, then it is treated as a count. (default: 256.0)
Return type: double

getNumEstimators()[source]¶

Returns: The number of trees in the ensemble. (default: 100)
Return type: int

getPredictionCol()[source]¶

Returns: The predicted label. (default: predictedLabel)
Return type: str

getRandomSeed()[source]¶

Returns: The seed used for the random number generator. (default: 1)
Return type: long

getScoreCol()[source]¶

Returns: The outlier score. (default: outlierScore)
Return type: str

classmethod read()[source]¶: Returns an MLReader instance for this class.

setBootstrap(value)[source]¶

Parameters: bootstrap – If true, draw sample for each tree with replacement. If false, do not sample with replacement. (default: false)

setContamination(value)[source]¶

Parameters: contamination – The fraction of outliers in the training data set. If this is set to 0.0, it speeds up the training and all predicted labels will be false. The model and outlier scores are otherwise unaffected by this parameter. (default: 0.0)

setContaminationError(value)[source]¶

Parameters: contaminationError – The error allowed when calculating the threshold required to achieve the specified contamination fraction. The default is 0.0, which forces an exact calculation of the threshold. The exact calculation is slow and can fail for large datasets. If there are issues with the exact calculation, a good choice for this parameter is often 1% of the specified contamination value. (default: 0.0)

setFeaturesCol(value)[source]¶

Parameters: featuresCol – The feature vector. (default: features)

setMaxFeatures(value)[source]¶

Parameters: maxFeatures – The number of features used to train each tree. If this value is between 0.0 and 1.0, then it is treated as a fraction. If it is >1.0, then it is treated as a count. (default: 1.0)

setMaxSamples(value)[source]¶

Parameters: maxSamples – The number of samples used to train each tree. If this value is between 0.0 and 1.0, then it is treated as a fraction. If it is >1.0, then it is treated as a count. (default: 256.0)

setNumEstimators(value)[source]¶

Parameters: numEstimators – The number of trees in the ensemble. (default: 100)

setParams(bootstrap=False, contamination=0.0, contaminationError=0.0, featuresCol='features', maxFeatures=1.0, maxSamples=256.0, numEstimators=100, predictionCol='predictedLabel', randomSeed=1, scoreCol='outlierScore')[source]¶

Set the (keyword only) parameters

Parameters

bootstrap (bool) – If true, draw sample for each tree with replacement. If false, do not sample with replacement. (default: false)
contamination (double) – The fraction of outliers in the training data set. If this is set to 0.0, it speeds up the training and all predicted labels will be false. The model and outlier scores are otherwise unaffected by this parameter. (default: 0.0)
contaminationError (double) – The error allowed when calculating the threshold required to achieve the specified contamination fraction. The default is 0.0, which forces an exact calculation of the threshold. The exact calculation is slow and can fail for large datasets. If there are issues with the exact calculation, a good choice for this parameter is often 1% of the specified contamination value. (default: 0.0)
featuresCol (str) – The feature vector. (default: features)
maxFeatures (double) – The number of features used to train each tree. If this value is between 0.0 and 1.0, then it is treated as a fraction. If it is >1.0, then it is treated as a count. (default: 1.0)
maxSamples (double) – The number of samples used to train each tree. If this value is between 0.0 and 1.0, then it is treated as a fraction. If it is >1.0, then it is treated as a count. (default: 256.0)
numEstimators (int) – The number of trees in the ensemble. (default: 100)
predictionCol (str) – The predicted label. (default: predictedLabel)
randomSeed (long) – The seed used for the random number generator. (default: 1)
scoreCol (str) – The outlier score. (default: outlierScore)

setPredictionCol(value)[source]¶

Parameters: predictionCol – The predicted label. (default: predictedLabel)

setRandomSeed(value)[source]¶

Parameters: randomSeed – The seed used for the random number generator. (default: 1)

setScoreCol(value)[source]¶

Parameters: scoreCol – The outlier score. (default: outlierScore)

class mmlspark.isolationforest.IsolationForest.IsolationForestModel(java_model=None)[source]¶

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.wrapper.JavaModel, pyspark.ml.util.JavaMLWritable, pyspark.ml.util.JavaMLReadable

Model fitted by IsolationForest.

static getJavaPackage()[source]¶: Returns package name String.

classmethod read()[source]¶: Returns an MLReader instance for this class.

Module contents¶

MicrosoftML is a library of Python classes to interface with the Microsoft scala APIs to utilize Apache Spark to create distibuted machine learning models.

MicrosoftML simplifies training and scoring classifiers and regressors, as well as facilitating the creation of models using the CNTK library, images, and text.