synapse.ml.lightgbm package

Submodules

synapse.ml.lightgbm.LightGBMClassificationModel module

class synapse.ml.lightgbm.LightGBMClassificationModel.LightGBMClassificationModel(java_obj=None, actualNumClasses=None, featuresCol='features', featuresShapCol='', labelCol='label', leafPredictionCol='', lightGBMBooster=None, numIterations=- 1, predictDisableShapeCheck=False, predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', startIteration=0, thresholds=None)[source]

Bases: LightGBMModelMixin, _LightGBMClassificationModel

getBoosterNumClasses()[source]

Get the number of classes from the booster.

Returns:

The number of classes.

static loadNativeModelFromFile(filename)[source]

Load the model from a native LightGBM text file.

static loadNativeModelFromString(model)[source]

Load the model from a native LightGBM model string.

synapse.ml.lightgbm.LightGBMClassifier module

class synapse.ml.lightgbm.LightGBMClassifier.LightGBMClassifier(java_obj=None, baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', catSmooth=10.0, categoricalSlotIndexes=[], categoricalSlotNames=[], catl2=10.0, chunkSize=10000, dataRandomSeed=1, dataTransferMode='streaming', defaultListenPort=12400, deterministic=False, driverListenPort=0, dropRate=0.1, dropSeed=4, earlyStoppingRound=0, executionMode=None, extraSeed=6, featureFraction=1.0, featureFractionByNode=None, featureFractionSeed=2, featuresCol='features', featuresShapCol='', fobj=None, improvementTolerance=0.0, initScoreCol=None, isEnableSparse=True, isProvideTrainingMetric=False, isUnbalance=False, labelCol='label', lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxCatThreshold=32, maxCatToOnehot=4, maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxNumClasses=100, maxStreamingOMPThreads=16, metric='', microBatchSize=100, minDataInLeaf=20, minDataPerBin=3, minDataPerGroup=100, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', monotoneConstraints=[], monotoneConstraintsMethod='basic', monotonePenalty=0.0, negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='binary', objectiveSeed=5, otherRate=0.1, parallelism='data_parallel', passThroughArgs='', posBaggingFraction=1.0, predictDisableShapeCheck=False, predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', referenceDataset=None, repartitionByGroupingColumn=True, samplingMode='subset', samplingSubsetSize=1000000, seed=None, skipDrop=0.5, slotNames=[], thresholds=None, timeout=1200.0, topK=20, topRate=0.2, uniformDrop=False, useBarrierExecutionMode=False, useMissing=True, useSingleDatasetMode=True, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xGBoostDartMode=False, zeroAsMissing=False)[source]

Bases: ComplexParamsMixin, JavaMLReadable, JavaMLWritable, JavaEstimator

Parameters:
  • baggingFraction (float) – Bagging fraction

  • baggingFreq (int) – Bagging frequency

  • baggingSeed (int) – Bagging seed

  • binSampleCount (int) – Number of samples considered at computing histogram bins

  • boostFromAverage (bool) – Adjusts initial score to the mean of labels for faster convergence

  • boostingType (str) – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

  • catSmooth (float) – this can reduce the effect of noises in categorical features, especially for categories with few data

  • categoricalSlotIndexes (list) – List of categorical column indexes, the slot index in the features column

  • categoricalSlotNames (list) – List of categorical column slot names, the slot name in the features column

  • catl2 (float) – L2 regularization in categorical split

  • chunkSize (int) – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

  • dataRandomSeed (int) – Random seed for sampling data to construct histogram bins.

  • dataTransferMode (str) – Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.

  • defaultListenPort (int) – The default listen port on executors, used for testing

  • deterministic (bool) – Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true

  • driverListenPort (int) – The listen port on a driver. Default value is 0 (random)

  • dropRate (float) – Dropout rate: a fraction of previous trees to drop during the dropout

  • dropSeed (int) – Random seed to choose dropping models. Only used in dart.

  • earlyStoppingRound (int) – Early stopping round

  • executionMode (str) – Deprecated. Please use dataTransferMode.

  • extraSeed (int) – Random seed for selecting threshold when extra_trees is true

  • featureFraction (float) – Feature fraction

  • featureFractionByNode (float) – Feature fraction by node

  • featureFractionSeed (int) – Feature fraction seed

  • featuresCol (str) – features column name

  • featuresShapCol (str) – Output SHAP vector column name after prediction containing the feature contribution values

  • fobj (object) – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

  • improvementTolerance (float) – Tolerance to consider improvement in metric

  • initScoreCol (str) – The name of the initial score column, used for continued training

  • isEnableSparse (bool) – Used to enable/disable sparse optimization

  • isProvideTrainingMetric (bool) – Whether output metric result over training dataset.

  • isUnbalance (bool) – Set to true if training data is unbalanced in binary classification scenario

  • labelCol (str) – label column name

  • lambdaL1 (float) – L1 regularization

  • lambdaL2 (float) – L2 regularization

  • leafPredictionCol (str) – Predicted leaf indices’s column name

  • learningRate (float) – Learning rate or shrinkage rate

  • matrixType (str) – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

  • maxBin (int) – Max bin

  • maxBinByFeature (list) – Max number of bins for each feature

  • maxCatThreshold (int) – limit number of split points considered for categorical features

  • maxCatToOnehot (int) – when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used

  • maxDeltaStep (float) – Used to limit the max output of tree leaves

  • maxDepth (int) – Max depth

  • maxDrop (int) – Max number of dropped trees during one boosting iteration

  • maxNumClasses (int) – Number of max classes to infer numClass in multi-class classification.

  • maxStreamingOMPThreads (int) – Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.

  • metric (str) – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

  • microBatchSize (int) – Specify how many elements are sent in a streaming micro-batch.

  • minDataInLeaf (int) – Minimal number of data in one leaf. Can be used to deal with over-fitting.

  • minDataPerBin (int) – Minimal number of data inside one bin

  • minDataPerGroup (int) – minimal number of data per categorical group

  • minGainToSplit (float) – The minimal gain to perform split

  • minSumHessianInLeaf (float) – Minimal sum hessian in one leaf

  • modelString (str) – LightGBM model to retrain

  • monotoneConstraints (list) – used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.

  • monotoneConstraintsMethod (str) – Monotone constraints method. basic, intermediate, or advanced.

  • monotonePenalty (float) – A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.

  • negBaggingFraction (float) – Negative Bagging fraction

  • numBatches (int) – If greater than 0, splits data into separate batches during training

  • numIterations (int) – Number of iterations, LightGBM constructs num_class * num_iterations trees

  • numLeaves (int) – Number of leaves

  • numTasks (int) – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

  • numThreads (int) – Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.

  • objective (str) – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

  • objectiveSeed (int) – Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.

  • otherRate (float) – The retain ratio of small gradient data. Only used in goss.

  • parallelism (str) – Tree learner parallelism, can be set to data_parallel or voting_parallel

  • passThroughArgs (str) – Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true

  • posBaggingFraction (float) – Positive Bagging fraction

  • predictDisableShapeCheck (bool) – control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data

  • predictionCol (str) – prediction column name

  • probabilityCol (str) – Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities

  • rawPredictionCol (str) – raw prediction (a.k.a. confidence) column name

  • referenceDataset (list) – The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().

  • repartitionByGroupingColumn (bool) – Repartition training data according to grouping column, on by default.

  • samplingMode (str) – Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.

  • samplingSubsetSize (int) – Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.

  • seed (int) – Main seed, used to generate other seeds

  • skipDrop (float) – Probability of skipping the dropout procedure during a boosting iteration

  • slotNames (list) – List of slot names in the features column

  • thresholds (list) – Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold

  • timeout (float) – Timeout in seconds

  • topK (int) – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

  • topRate (float) – The retain ratio of large gradient data. Only used in goss.

  • uniformDrop (bool) – Set this to true to use uniform drop in dart mode

  • useBarrierExecutionMode (bool) – Barrier execution mode which uses a barrier stage, off by default.

  • useMissing (bool) – Set this to false to disable the special handle of missing value

  • useSingleDatasetMode (bool) – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.

  • validationIndicatorCol (str) – Indicates whether the row is for training or validation

  • verbosity (int) – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

  • weightCol (str) – The name of the weight column

  • xGBoostDartMode (bool) – Set this to true to use xgboost dart mode

  • zeroAsMissing (bool) – Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values

baggingFraction = Param(parent='undefined', name='baggingFraction', doc='Bagging fraction')
baggingFreq = Param(parent='undefined', name='baggingFreq', doc='Bagging frequency')
baggingSeed = Param(parent='undefined', name='baggingSeed', doc='Bagging seed')
binSampleCount = Param(parent='undefined', name='binSampleCount', doc='Number of samples considered at computing histogram bins')
boostFromAverage = Param(parent='undefined', name='boostFromAverage', doc='Adjusts initial score to the mean of labels for faster convergence')
boostingType = Param(parent='undefined', name='boostingType', doc='Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling). ')
catSmooth = Param(parent='undefined', name='catSmooth', doc='this can reduce the effect of noises in categorical features, especially for categories with few data')
categoricalSlotIndexes = Param(parent='undefined', name='categoricalSlotIndexes', doc='List of categorical column indexes, the slot index in the features column')
categoricalSlotNames = Param(parent='undefined', name='categoricalSlotNames', doc='List of categorical column slot names, the slot name in the features column')
catl2 = Param(parent='undefined', name='catl2', doc='L2 regularization in categorical split')
chunkSize = Param(parent='undefined', name='chunkSize', doc='Advanced parameter to specify the chunk size for copying Java data to native.  If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.')
dataRandomSeed = Param(parent='undefined', name='dataRandomSeed', doc='Random seed for sampling data to construct histogram bins.')
dataTransferMode = Param(parent='undefined', name='dataTransferMode', doc='Specify how SynapseML transfers data from Spark to LightGBM.  Values can be streaming, bulk. Default is bulk, which is the legacy mode.')
defaultListenPort = Param(parent='undefined', name='defaultListenPort', doc='The default listen port on executors, used for testing')
deterministic = Param(parent='undefined', name='deterministic', doc='Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters.  Note: setting this to true may slow down training.  To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true')
driverListenPort = Param(parent='undefined', name='driverListenPort', doc='The listen port on a driver. Default value is 0 (random)')
dropRate = Param(parent='undefined', name='dropRate', doc='Dropout rate: a fraction of previous trees to drop during the dropout')
dropSeed = Param(parent='undefined', name='dropSeed', doc='Random seed to choose dropping models. Only used in dart.')
earlyStoppingRound = Param(parent='undefined', name='earlyStoppingRound', doc='Early stopping round')
executionMode = Param(parent='undefined', name='executionMode', doc='Deprecated. Please use dataTransferMode.')
extraSeed = Param(parent='undefined', name='extraSeed', doc='Random seed for selecting threshold when extra_trees is true')
featureFraction = Param(parent='undefined', name='featureFraction', doc='Feature fraction')
featureFractionByNode = Param(parent='undefined', name='featureFractionByNode', doc='Feature fraction by node')
featureFractionSeed = Param(parent='undefined', name='featureFractionSeed', doc='Feature fraction seed')
featuresCol = Param(parent='undefined', name='featuresCol', doc='features column name')
featuresShapCol = Param(parent='undefined', name='featuresShapCol', doc='Output SHAP vector column name after prediction containing the feature contribution values')
fobj = Param(parent='undefined', name='fobj', doc='Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).')
getBaggingFraction()[source]
Returns:

Bagging fraction

Return type:

baggingFraction

getBaggingFreq()[source]
Returns:

Bagging frequency

Return type:

baggingFreq

getBaggingSeed()[source]
Returns:

Bagging seed

Return type:

baggingSeed

getBinSampleCount()[source]
Returns:

Number of samples considered at computing histogram bins

Return type:

binSampleCount

getBoostFromAverage()[source]
Returns:

Adjusts initial score to the mean of labels for faster convergence

Return type:

boostFromAverage

getBoostingType()[source]
Returns:

Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

Return type:

boostingType

getCatSmooth()[source]
Returns:

this can reduce the effect of noises in categorical features, especially for categories with few data

Return type:

catSmooth

getCategoricalSlotIndexes()[source]
Returns:

List of categorical column indexes, the slot index in the features column

Return type:

categoricalSlotIndexes

getCategoricalSlotNames()[source]
Returns:

List of categorical column slot names, the slot name in the features column

Return type:

categoricalSlotNames

getCatl2()[source]
Returns:

L2 regularization in categorical split

Return type:

catl2

getChunkSize()[source]
Returns:

Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

Return type:

chunkSize

getDataRandomSeed()[source]
Returns:

Random seed for sampling data to construct histogram bins.

Return type:

dataRandomSeed

getDataTransferMode()[source]
Returns:

Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.

Return type:

dataTransferMode

getDefaultListenPort()[source]
Returns:

The default listen port on executors, used for testing

Return type:

defaultListenPort

getDeterministic()[source]
Returns:

Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true

Return type:

deterministic

getDriverListenPort()[source]
Returns:

The listen port on a driver. Default value is 0 (random)

Return type:

driverListenPort

getDropRate()[source]
Returns:

Dropout rate: a fraction of previous trees to drop during the dropout

Return type:

dropRate

getDropSeed()[source]
Returns:

Random seed to choose dropping models. Only used in dart.

Return type:

dropSeed

getEarlyStoppingRound()[source]
Returns:

Early stopping round

Return type:

earlyStoppingRound

getExecutionMode()[source]
Returns:

Deprecated. Please use dataTransferMode.

Return type:

executionMode

getExtraSeed()[source]
Returns:

Random seed for selecting threshold when extra_trees is true

Return type:

extraSeed

getFeatureFraction()[source]
Returns:

Feature fraction

Return type:

featureFraction

getFeatureFractionByNode()[source]
Returns:

Feature fraction by node

Return type:

featureFractionByNode

getFeatureFractionSeed()[source]
Returns:

Feature fraction seed

Return type:

featureFractionSeed

getFeaturesCol()[source]
Returns:

features column name

Return type:

featuresCol

getFeaturesShapCol()[source]
Returns:

Output SHAP vector column name after prediction containing the feature contribution values

Return type:

featuresShapCol

getFobj()[source]
Returns:

Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

Return type:

fobj

getImprovementTolerance()[source]
Returns:

Tolerance to consider improvement in metric

Return type:

improvementTolerance

getInitScoreCol()[source]
Returns:

The name of the initial score column, used for continued training

Return type:

initScoreCol

getIsEnableSparse()[source]
Returns:

Used to enable/disable sparse optimization

Return type:

isEnableSparse

getIsProvideTrainingMetric()[source]
Returns:

Whether output metric result over training dataset.

Return type:

isProvideTrainingMetric

getIsUnbalance()[source]
Returns:

Set to true if training data is unbalanced in binary classification scenario

Return type:

isUnbalance

static getJavaPackage()[source]

Returns package name String.

getLabelCol()[source]
Returns:

label column name

Return type:

labelCol

getLambdaL1()[source]
Returns:

L1 regularization

Return type:

lambdaL1

getLambdaL2()[source]
Returns:

L2 regularization

Return type:

lambdaL2

getLeafPredictionCol()[source]
Returns:

Predicted leaf indices’s column name

Return type:

leafPredictionCol

getLearningRate()[source]
Returns:

Learning rate or shrinkage rate

Return type:

learningRate

getMatrixType()[source]
Returns:

Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

Return type:

matrixType

getMaxBin()[source]
Returns:

Max bin

Return type:

maxBin

getMaxBinByFeature()[source]
Returns:

Max number of bins for each feature

Return type:

maxBinByFeature

getMaxCatThreshold()[source]
Returns:

limit number of split points considered for categorical features

Return type:

maxCatThreshold

getMaxCatToOnehot()[source]
Returns:

when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used

Return type:

maxCatToOnehot

getMaxDeltaStep()[source]
Returns:

Used to limit the max output of tree leaves

Return type:

maxDeltaStep

getMaxDepth()[source]
Returns:

Max depth

Return type:

maxDepth

getMaxDrop()[source]
Returns:

Max number of dropped trees during one boosting iteration

Return type:

maxDrop

getMaxNumClasses()[source]
Returns:

Number of max classes to infer numClass in multi-class classification.

Return type:

maxNumClasses

getMaxStreamingOMPThreads()[source]
Returns:

Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.

Return type:

maxStreamingOMPThreads

getMetric()[source]
Returns:

Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

Return type:

metric

getMicroBatchSize()[source]
Returns:

Specify how many elements are sent in a streaming micro-batch.

Return type:

microBatchSize

getMinDataInLeaf()[source]
Returns:

Minimal number of data in one leaf. Can be used to deal with over-fitting.

Return type:

minDataInLeaf

getMinDataPerBin()[source]
Returns:

Minimal number of data inside one bin

Return type:

minDataPerBin

getMinDataPerGroup()[source]
Returns:

minimal number of data per categorical group

Return type:

minDataPerGroup

getMinGainToSplit()[source]
Returns:

The minimal gain to perform split

Return type:

minGainToSplit

getMinSumHessianInLeaf()[source]
Returns:

Minimal sum hessian in one leaf

Return type:

minSumHessianInLeaf

getModelString()[source]
Returns:

LightGBM model to retrain

Return type:

modelString

getMonotoneConstraints()[source]
Returns:

used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.

Return type:

monotoneConstraints

getMonotoneConstraintsMethod()[source]
Returns:

Monotone constraints method. basic, intermediate, or advanced.

Return type:

monotoneConstraintsMethod

getMonotonePenalty()[source]
Returns:

A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.

Return type:

monotonePenalty

getNegBaggingFraction()[source]
Returns:

Negative Bagging fraction

Return type:

negBaggingFraction

getNumBatches()[source]
Returns:

If greater than 0, splits data into separate batches during training

Return type:

numBatches

getNumIterations()[source]
Returns:

Number of iterations, LightGBM constructs num_class * num_iterations trees

Return type:

numIterations

getNumLeaves()[source]
Returns:

Number of leaves

Return type:

numLeaves

getNumTasks()[source]
Returns:

Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

Return type:

numTasks

getNumThreads()[source]
Returns:

Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.

Return type:

numThreads

getObjective()[source]
Returns:

The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

Return type:

objective

getObjectiveSeed()[source]
Returns:

Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.

Return type:

objectiveSeed

getOtherRate()[source]
Returns:

The retain ratio of small gradient data. Only used in goss.

Return type:

otherRate

getParallelism()[source]
Returns:

Tree learner parallelism, can be set to data_parallel or voting_parallel

Return type:

parallelism

getPassThroughArgs()[source]
Returns:

Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true

Return type:

passThroughArgs

getPosBaggingFraction()[source]
Returns:

Positive Bagging fraction

Return type:

posBaggingFraction

getPredictDisableShapeCheck()[source]
Returns:

control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data

Return type:

predictDisableShapeCheck

getPredictionCol()[source]
Returns:

prediction column name

Return type:

predictionCol

getProbabilityCol()[source]
Returns:

Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities

Return type:

probabilityCol

getRawPredictionCol()[source]
Returns:

raw prediction (a.k.a. confidence) column name

Return type:

rawPredictionCol

getReferenceDataset()[source]
Returns:

The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().

Return type:

referenceDataset

getRepartitionByGroupingColumn()[source]
Returns:

Repartition training data according to grouping column, on by default.

Return type:

repartitionByGroupingColumn

getSamplingMode()[source]
Returns:

Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.

Return type:

samplingMode

getSamplingSubsetSize()[source]
Returns:

Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.

Return type:

samplingSubsetSize

getSeed()[source]
Returns:

Main seed, used to generate other seeds

Return type:

seed

getSkipDrop()[source]
Returns:

Probability of skipping the dropout procedure during a boosting iteration

Return type:

skipDrop

getSlotNames()[source]
Returns:

List of slot names in the features column

Return type:

slotNames

getThresholds()[source]
Returns:

Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold

Return type:

thresholds

getTimeout()[source]
Returns:

Timeout in seconds

Return type:

timeout

getTopK()[source]
Returns:

The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

Return type:

topK

getTopRate()[source]
Returns:

The retain ratio of large gradient data. Only used in goss.

Return type:

topRate

getUniformDrop()[source]
Returns:

Set this to true to use uniform drop in dart mode

Return type:

uniformDrop

getUseBarrierExecutionMode()[source]
Returns:

Barrier execution mode which uses a barrier stage, off by default.

Return type:

useBarrierExecutionMode

getUseMissing()[source]
Returns:

Set this to false to disable the special handle of missing value

Return type:

useMissing

getUseSingleDatasetMode()[source]
Returns:

Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.

Return type:

useSingleDatasetMode

getValidationIndicatorCol()[source]
Returns:

Indicates whether the row is for training or validation

Return type:

validationIndicatorCol

getVerbosity()[source]
Returns:

Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

Return type:

verbosity

getWeightCol()[source]
Returns:

The name of the weight column

Return type:

weightCol

getXGBoostDartMode()[source]
Returns:

Set this to true to use xgboost dart mode

Return type:

xGBoostDartMode

getZeroAsMissing()[source]
Returns:

Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values

Return type:

zeroAsMissing

improvementTolerance = Param(parent='undefined', name='improvementTolerance', doc='Tolerance to consider improvement in metric')
initScoreCol = Param(parent='undefined', name='initScoreCol', doc='The name of the initial score column, used for continued training')
isEnableSparse = Param(parent='undefined', name='isEnableSparse', doc='Used to enable/disable sparse optimization')
isProvideTrainingMetric = Param(parent='undefined', name='isProvideTrainingMetric', doc='Whether output metric result over training dataset.')
isUnbalance = Param(parent='undefined', name='isUnbalance', doc='Set to true if training data is unbalanced in binary classification scenario')
labelCol = Param(parent='undefined', name='labelCol', doc='label column name')
lambdaL1 = Param(parent='undefined', name='lambdaL1', doc='L1 regularization')
lambdaL2 = Param(parent='undefined', name='lambdaL2', doc='L2 regularization')
leafPredictionCol = Param(parent='undefined', name='leafPredictionCol', doc="Predicted leaf indices's column name")
learningRate = Param(parent='undefined', name='learningRate', doc='Learning rate or shrinkage rate')
matrixType = Param(parent='undefined', name='matrixType', doc='Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense.  Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.')
maxBin = Param(parent='undefined', name='maxBin', doc='Max bin')
maxBinByFeature = Param(parent='undefined', name='maxBinByFeature', doc='Max number of bins for each feature')
maxCatThreshold = Param(parent='undefined', name='maxCatThreshold', doc='limit number of split points considered for categorical features')
maxCatToOnehot = Param(parent='undefined', name='maxCatToOnehot', doc='when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used')
maxDeltaStep = Param(parent='undefined', name='maxDeltaStep', doc='Used to limit the max output of tree leaves')
maxDepth = Param(parent='undefined', name='maxDepth', doc='Max depth')
maxDrop = Param(parent='undefined', name='maxDrop', doc='Max number of dropped trees during one boosting iteration')
maxNumClasses = Param(parent='undefined', name='maxNumClasses', doc='Number of max classes to infer numClass in multi-class classification.')
maxStreamingOMPThreads = Param(parent='undefined', name='maxStreamingOMPThreads', doc="Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it's best to set a fixed value.")
metric = Param(parent='undefined', name='metric', doc='Metrics to be evaluated on the evaluation data.  Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv. ')
microBatchSize = Param(parent='undefined', name='microBatchSize', doc='Specify how many elements are sent in a streaming micro-batch.')
minDataInLeaf = Param(parent='undefined', name='minDataInLeaf', doc='Minimal number of data in one leaf. Can be used to deal with over-fitting.')
minDataPerBin = Param(parent='undefined', name='minDataPerBin', doc='Minimal number of data inside one bin')
minDataPerGroup = Param(parent='undefined', name='minDataPerGroup', doc='minimal number of data per categorical group')
minGainToSplit = Param(parent='undefined', name='minGainToSplit', doc='The minimal gain to perform split')
minSumHessianInLeaf = Param(parent='undefined', name='minSumHessianInLeaf', doc='Minimal sum hessian in one leaf')
modelString = Param(parent='undefined', name='modelString', doc='LightGBM model to retrain')
monotoneConstraints = Param(parent='undefined', name='monotoneConstraints', doc='used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.')
monotoneConstraintsMethod = Param(parent='undefined', name='monotoneConstraintsMethod', doc='Monotone constraints method. basic, intermediate, or advanced.')
monotonePenalty = Param(parent='undefined', name='monotonePenalty', doc='A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.')
negBaggingFraction = Param(parent='undefined', name='negBaggingFraction', doc='Negative Bagging fraction')
numBatches = Param(parent='undefined', name='numBatches', doc='If greater than 0, splits data into separate batches during training')
numIterations = Param(parent='undefined', name='numIterations', doc='Number of iterations, LightGBM constructs num_class * num_iterations trees')
numLeaves = Param(parent='undefined', name='numLeaves', doc='Number of leaves')
numTasks = Param(parent='undefined', name='numTasks', doc='Advanced parameter to specify the number of tasks.  SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.')
numThreads = Param(parent='undefined', name='numThreads', doc='Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.')
objective = Param(parent='undefined', name='objective', doc='The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova. ')
objectiveSeed = Param(parent='undefined', name='objectiveSeed', doc='Random seed for objectives, if random process is needed.  Currently used only for rank_xendcg objective.')
otherRate = Param(parent='undefined', name='otherRate', doc='The retain ratio of small gradient data. Only used in goss.')
parallelism = Param(parent='undefined', name='parallelism', doc='Tree learner parallelism, can be set to data_parallel or voting_parallel')
passThroughArgs = Param(parent='undefined', name='passThroughArgs', doc='Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true')
posBaggingFraction = Param(parent='undefined', name='posBaggingFraction', doc='Positive Bagging fraction')
predictDisableShapeCheck = Param(parent='undefined', name='predictDisableShapeCheck', doc='control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data')
predictionCol = Param(parent='undefined', name='predictionCol', doc='prediction column name')
probabilityCol = Param(parent='undefined', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities')
rawPredictionCol = Param(parent='undefined', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name')
classmethod read()[source]

Returns an MLReader instance for this class.

referenceDataset = Param(parent='undefined', name='referenceDataset', doc='The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().')
repartitionByGroupingColumn = Param(parent='undefined', name='repartitionByGroupingColumn', doc='Repartition training data according to grouping column, on by default.')
samplingMode = Param(parent='undefined', name='samplingMode', doc="Data sampling for streaming mode. Sampled data is used to define bins. 'global': sample from all data, 'subset': sample from first N rows, or 'fixed': Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.")
samplingSubsetSize = Param(parent='undefined', name='samplingSubsetSize', doc="Specify subset size N for the sampling mode 'subset'. 'binSampleCount' rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.")
seed = Param(parent='undefined', name='seed', doc='Main seed, used to generate other seeds')
setBaggingFraction(value)[source]
Parameters:

baggingFraction – Bagging fraction

setBaggingFreq(value)[source]
Parameters:

baggingFreq – Bagging frequency

setBaggingSeed(value)[source]
Parameters:

baggingSeed – Bagging seed

setBinSampleCount(value)[source]
Parameters:

binSampleCount – Number of samples considered at computing histogram bins

setBoostFromAverage(value)[source]
Parameters:

boostFromAverage – Adjusts initial score to the mean of labels for faster convergence

setBoostingType(value)[source]
Parameters:

boostingType – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

setCatSmooth(value)[source]
Parameters:

catSmooth – this can reduce the effect of noises in categorical features, especially for categories with few data

setCategoricalSlotIndexes(value)[source]
Parameters:

categoricalSlotIndexes – List of categorical column indexes, the slot index in the features column

setCategoricalSlotNames(value)[source]
Parameters:

categoricalSlotNames – List of categorical column slot names, the slot name in the features column

setCatl2(value)[source]
Parameters:

catl2 – L2 regularization in categorical split

setChunkSize(value)[source]
Parameters:

chunkSize – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

setDataRandomSeed(value)[source]
Parameters:

dataRandomSeed – Random seed for sampling data to construct histogram bins.

setDataTransferMode(value)[source]
Parameters:

dataTransferMode – Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.

setDefaultListenPort(value)[source]
Parameters:

defaultListenPort – The default listen port on executors, used for testing

setDeterministic(value)[source]
Parameters:

deterministic – Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true

setDriverListenPort(value)[source]
Parameters:

driverListenPort – The listen port on a driver. Default value is 0 (random)

setDropRate(value)[source]
Parameters:

dropRate – Dropout rate: a fraction of previous trees to drop during the dropout

setDropSeed(value)[source]
Parameters:

dropSeed – Random seed to choose dropping models. Only used in dart.

setEarlyStoppingRound(value)[source]
Parameters:

earlyStoppingRound – Early stopping round

setExecutionMode(value)[source]
Parameters:

executionMode – Deprecated. Please use dataTransferMode.

setExtraSeed(value)[source]
Parameters:

extraSeed – Random seed for selecting threshold when extra_trees is true

setFeatureFraction(value)[source]
Parameters:

featureFraction – Feature fraction

setFeatureFractionByNode(value)[source]
Parameters:

featureFractionByNode – Feature fraction by node

setFeatureFractionSeed(value)[source]
Parameters:

featureFractionSeed – Feature fraction seed

setFeaturesCol(value)[source]
Parameters:

featuresCol – features column name

setFeaturesShapCol(value)[source]
Parameters:

featuresShapCol – Output SHAP vector column name after prediction containing the feature contribution values

setFobj(value)[source]
Parameters:

fobj – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

setImprovementTolerance(value)[source]
Parameters:

improvementTolerance – Tolerance to consider improvement in metric

setInitScoreCol(value)[source]
Parameters:

initScoreCol – The name of the initial score column, used for continued training

setIsEnableSparse(value)[source]
Parameters:

isEnableSparse – Used to enable/disable sparse optimization

setIsProvideTrainingMetric(value)[source]
Parameters:

isProvideTrainingMetric – Whether output metric result over training dataset.

setIsUnbalance(value)[source]
Parameters:

isUnbalance – Set to true if training data is unbalanced in binary classification scenario

setLabelCol(value)[source]
Parameters:

labelCol – label column name

setLambdaL1(value)[source]
Parameters:

lambdaL1 – L1 regularization

setLambdaL2(value)[source]
Parameters:

lambdaL2 – L2 regularization

setLeafPredictionCol(value)[source]
Parameters:

leafPredictionCol – Predicted leaf indices’s column name

setLearningRate(value)[source]
Parameters:

learningRate – Learning rate or shrinkage rate

setMatrixType(value)[source]
Parameters:

matrixType – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

setMaxBin(value)[source]
Parameters:

maxBin – Max bin

setMaxBinByFeature(value)[source]
Parameters:

maxBinByFeature – Max number of bins for each feature

setMaxCatThreshold(value)[source]
Parameters:

maxCatThreshold – limit number of split points considered for categorical features

setMaxCatToOnehot(value)[source]
Parameters:

maxCatToOnehot – when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used

setMaxDeltaStep(value)[source]
Parameters:

maxDeltaStep – Used to limit the max output of tree leaves

setMaxDepth(value)[source]
Parameters:

maxDepth – Max depth

setMaxDrop(value)[source]
Parameters:

maxDrop – Max number of dropped trees during one boosting iteration

setMaxNumClasses(value)[source]
Parameters:

maxNumClasses – Number of max classes to infer numClass in multi-class classification.

setMaxStreamingOMPThreads(value)[source]
Parameters:

maxStreamingOMPThreads – Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.

setMetric(value)[source]
Parameters:

metric – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

setMicroBatchSize(value)[source]
Parameters:

microBatchSize – Specify how many elements are sent in a streaming micro-batch.

setMinDataInLeaf(value)[source]
Parameters:

minDataInLeaf – Minimal number of data in one leaf. Can be used to deal with over-fitting.

setMinDataPerBin(value)[source]
Parameters:

minDataPerBin – Minimal number of data inside one bin

setMinDataPerGroup(value)[source]
Parameters:

minDataPerGroup – minimal number of data per categorical group

setMinGainToSplit(value)[source]
Parameters:

minGainToSplit – The minimal gain to perform split

setMinSumHessianInLeaf(value)[source]
Parameters:

minSumHessianInLeaf – Minimal sum hessian in one leaf

setModelString(value)[source]
Parameters:

modelString – LightGBM model to retrain

setMonotoneConstraints(value)[source]
Parameters:

monotoneConstraints – used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.

setMonotoneConstraintsMethod(value)[source]
Parameters:

monotoneConstraintsMethod – Monotone constraints method. basic, intermediate, or advanced.

setMonotonePenalty(value)[source]
Parameters:

monotonePenalty – A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.

setNegBaggingFraction(value)[source]
Parameters:

negBaggingFraction – Negative Bagging fraction

setNumBatches(value)[source]
Parameters:

numBatches – If greater than 0, splits data into separate batches during training

setNumIterations(value)[source]
Parameters:

numIterations – Number of iterations, LightGBM constructs num_class * num_iterations trees

setNumLeaves(value)[source]
Parameters:

numLeaves – Number of leaves

setNumTasks(value)[source]
Parameters:

numTasks – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

setNumThreads(value)[source]
Parameters:

numThreads – Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.

setObjective(value)[source]
Parameters:

objective – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

setObjectiveSeed(value)[source]
Parameters:

objectiveSeed – Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.

setOtherRate(value)[source]
Parameters:

otherRate – The retain ratio of small gradient data. Only used in goss.

setParallelism(value)[source]
Parameters:

parallelism – Tree learner parallelism, can be set to data_parallel or voting_parallel

setParams(baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', catSmooth=10.0, categoricalSlotIndexes=[], categoricalSlotNames=[], catl2=10.0, chunkSize=10000, dataRandomSeed=1, dataTransferMode='streaming', defaultListenPort=12400, deterministic=False, driverListenPort=0, dropRate=0.1, dropSeed=4, earlyStoppingRound=0, executionMode=None, extraSeed=6, featureFraction=1.0, featureFractionByNode=None, featureFractionSeed=2, featuresCol='features', featuresShapCol='', fobj=None, improvementTolerance=0.0, initScoreCol=None, isEnableSparse=True, isProvideTrainingMetric=False, isUnbalance=False, labelCol='label', lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxCatThreshold=32, maxCatToOnehot=4, maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxNumClasses=100, maxStreamingOMPThreads=16, metric='', microBatchSize=100, minDataInLeaf=20, minDataPerBin=3, minDataPerGroup=100, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', monotoneConstraints=[], monotoneConstraintsMethod='basic', monotonePenalty=0.0, negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='binary', objectiveSeed=5, otherRate=0.1, parallelism='data_parallel', passThroughArgs='', posBaggingFraction=1.0, predictDisableShapeCheck=False, predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', referenceDataset=None, repartitionByGroupingColumn=True, samplingMode='subset', samplingSubsetSize=1000000, seed=None, skipDrop=0.5, slotNames=[], thresholds=None, timeout=1200.0, topK=20, topRate=0.2, uniformDrop=False, useBarrierExecutionMode=False, useMissing=True, useSingleDatasetMode=True, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xGBoostDartMode=False, zeroAsMissing=False)[source]

Set the (keyword only) parameters

setPassThroughArgs(value)[source]
Parameters:

passThroughArgs – Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true

setPosBaggingFraction(value)[source]
Parameters:

posBaggingFraction – Positive Bagging fraction

setPredictDisableShapeCheck(value)[source]
Parameters:

predictDisableShapeCheck – control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data

setPredictionCol(value)[source]
Parameters:

predictionCol – prediction column name

setProbabilityCol(value)[source]
Parameters:

probabilityCol – Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities

setRawPredictionCol(value)[source]
Parameters:

rawPredictionCol – raw prediction (a.k.a. confidence) column name

setReferenceDataset(value)[source]
Parameters:

referenceDataset – The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().

setRepartitionByGroupingColumn(value)[source]
Parameters:

repartitionByGroupingColumn – Repartition training data according to grouping column, on by default.

setSamplingMode(value)[source]
Parameters:

samplingMode – Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.

setSamplingSubsetSize(value)[source]
Parameters:

samplingSubsetSize – Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.

setSeed(value)[source]
Parameters:

seed – Main seed, used to generate other seeds

setSkipDrop(value)[source]
Parameters:

skipDrop – Probability of skipping the dropout procedure during a boosting iteration

setSlotNames(value)[source]
Parameters:

slotNames – List of slot names in the features column

setThresholds(value)[source]
Parameters:

thresholds – Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold

setTimeout(value)[source]
Parameters:

timeout – Timeout in seconds

setTopK(value)[source]
Parameters:

topK – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

setTopRate(value)[source]
Parameters:

topRate – The retain ratio of large gradient data. Only used in goss.

setUniformDrop(value)[source]
Parameters:

uniformDrop – Set this to true to use uniform drop in dart mode

setUseBarrierExecutionMode(value)[source]
Parameters:

useBarrierExecutionMode – Barrier execution mode which uses a barrier stage, off by default.

setUseMissing(value)[source]
Parameters:

useMissing – Set this to false to disable the special handle of missing value

setUseSingleDatasetMode(value)[source]
Parameters:

useSingleDatasetMode – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.

setValidationIndicatorCol(value)[source]
Parameters:

validationIndicatorCol – Indicates whether the row is for training or validation

setVerbosity(value)[source]
Parameters:

verbosity – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

setWeightCol(value)[source]
Parameters:

weightCol – The name of the weight column

setXGBoostDartMode(value)[source]
Parameters:

xGBoostDartMode – Set this to true to use xgboost dart mode

setZeroAsMissing(value)[source]
Parameters:

zeroAsMissing – Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values

skipDrop = Param(parent='undefined', name='skipDrop', doc='Probability of skipping the dropout procedure during a boosting iteration')
slotNames = Param(parent='undefined', name='slotNames', doc='List of slot names in the features column')
thresholds = Param(parent='undefined', name='thresholds', doc="Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold")
timeout = Param(parent='undefined', name='timeout', doc='Timeout in seconds')
topK = Param(parent='undefined', name='topK', doc='The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0')
topRate = Param(parent='undefined', name='topRate', doc='The retain ratio of large gradient data. Only used in goss.')
uniformDrop = Param(parent='undefined', name='uniformDrop', doc='Set this to true to use uniform drop in dart mode')
useBarrierExecutionMode = Param(parent='undefined', name='useBarrierExecutionMode', doc='Barrier execution mode which uses a barrier stage, off by default.')
useMissing = Param(parent='undefined', name='useMissing', doc='Set this to false to disable the special handle of missing value')
useSingleDatasetMode = Param(parent='undefined', name='useSingleDatasetMode', doc='Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.')
validationIndicatorCol = Param(parent='undefined', name='validationIndicatorCol', doc='Indicates whether the row is for training or validation')
verbosity = Param(parent='undefined', name='verbosity', doc='Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug')
weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')
xGBoostDartMode = Param(parent='undefined', name='xGBoostDartMode', doc='Set this to true to use xgboost dart mode')
zeroAsMissing = Param(parent='undefined', name='zeroAsMissing', doc='Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values')

synapse.ml.lightgbm.LightGBMRanker module

class synapse.ml.lightgbm.LightGBMRanker.LightGBMRanker(java_obj=None, baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', catSmooth=10.0, categoricalSlotIndexes=[], categoricalSlotNames=[], catl2=10.0, chunkSize=10000, dataRandomSeed=1, dataTransferMode='streaming', defaultListenPort=12400, deterministic=False, driverListenPort=0, dropRate=0.1, dropSeed=4, earlyStoppingRound=0, evalAt=[1, 2, 3, 4, 5], executionMode=None, extraSeed=6, featureFraction=1.0, featureFractionByNode=None, featureFractionSeed=2, featuresCol='features', featuresShapCol='', fobj=None, groupCol=None, improvementTolerance=0.0, initScoreCol=None, isEnableSparse=True, isProvideTrainingMetric=False, labelCol='label', labelGain=[], lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxCatThreshold=32, maxCatToOnehot=4, maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxNumClasses=100, maxPosition=20, maxStreamingOMPThreads=16, metric='', microBatchSize=100, minDataInLeaf=20, minDataPerBin=3, minDataPerGroup=100, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', monotoneConstraints=[], monotoneConstraintsMethod='basic', monotonePenalty=0.0, negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='lambdarank', objectiveSeed=5, otherRate=0.1, parallelism='data_parallel', passThroughArgs='', posBaggingFraction=1.0, predictDisableShapeCheck=False, predictionCol='prediction', referenceDataset=None, repartitionByGroupingColumn=True, samplingMode='subset', samplingSubsetSize=1000000, seed=None, skipDrop=0.5, slotNames=[], timeout=1200.0, topK=20, topRate=0.2, uniformDrop=False, useBarrierExecutionMode=False, useMissing=True, useSingleDatasetMode=True, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xGBoostDartMode=False, zeroAsMissing=False)[source]

Bases: ComplexParamsMixin, JavaMLReadable, JavaMLWritable, JavaEstimator

Parameters:
  • baggingFraction (float) – Bagging fraction

  • baggingFreq (int) – Bagging frequency

  • baggingSeed (int) – Bagging seed

  • binSampleCount (int) – Number of samples considered at computing histogram bins

  • boostFromAverage (bool) – Adjusts initial score to the mean of labels for faster convergence

  • boostingType (str) – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

  • catSmooth (float) – this can reduce the effect of noises in categorical features, especially for categories with few data

  • categoricalSlotIndexes (list) – List of categorical column indexes, the slot index in the features column

  • categoricalSlotNames (list) – List of categorical column slot names, the slot name in the features column

  • catl2 (float) – L2 regularization in categorical split

  • chunkSize (int) – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

  • dataRandomSeed (int) – Random seed for sampling data to construct histogram bins.

  • dataTransferMode (str) – Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.

  • defaultListenPort (int) – The default listen port on executors, used for testing

  • deterministic (bool) – Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true

  • driverListenPort (int) – The listen port on a driver. Default value is 0 (random)

  • dropRate (float) – Dropout rate: a fraction of previous trees to drop during the dropout

  • dropSeed (int) – Random seed to choose dropping models. Only used in dart.

  • earlyStoppingRound (int) – Early stopping round

  • evalAt (list) – NDCG and MAP evaluation positions, separated by comma

  • executionMode (str) – Deprecated. Please use dataTransferMode.

  • extraSeed (int) – Random seed for selecting threshold when extra_trees is true

  • featureFraction (float) – Feature fraction

  • featureFractionByNode (float) – Feature fraction by node

  • featureFractionSeed (int) – Feature fraction seed

  • featuresCol (str) – features column name

  • featuresShapCol (str) – Output SHAP vector column name after prediction containing the feature contribution values

  • fobj (object) – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

  • groupCol (str) – The name of the group column

  • improvementTolerance (float) – Tolerance to consider improvement in metric

  • initScoreCol (str) – The name of the initial score column, used for continued training

  • isEnableSparse (bool) – Used to enable/disable sparse optimization

  • isProvideTrainingMetric (bool) – Whether output metric result over training dataset.

  • labelCol (str) – label column name

  • labelGain (list) – graded relevance for each label in NDCG

  • lambdaL1 (float) – L1 regularization

  • lambdaL2 (float) – L2 regularization

  • leafPredictionCol (str) – Predicted leaf indices’s column name

  • learningRate (float) – Learning rate or shrinkage rate

  • matrixType (str) – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

  • maxBin (int) – Max bin

  • maxBinByFeature (list) – Max number of bins for each feature

  • maxCatThreshold (int) – limit number of split points considered for categorical features

  • maxCatToOnehot (int) – when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used

  • maxDeltaStep (float) – Used to limit the max output of tree leaves

  • maxDepth (int) – Max depth

  • maxDrop (int) – Max number of dropped trees during one boosting iteration

  • maxNumClasses (int) – Number of max classes to infer numClass in multi-class classification.

  • maxPosition (int) – optimized NDCG at this position

  • maxStreamingOMPThreads (int) – Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.

  • metric (str) – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

  • microBatchSize (int) – Specify how many elements are sent in a streaming micro-batch.

  • minDataInLeaf (int) – Minimal number of data in one leaf. Can be used to deal with over-fitting.

  • minDataPerBin (int) – Minimal number of data inside one bin

  • minDataPerGroup (int) – minimal number of data per categorical group

  • minGainToSplit (float) – The minimal gain to perform split

  • minSumHessianInLeaf (float) – Minimal sum hessian in one leaf

  • modelString (str) – LightGBM model to retrain

  • monotoneConstraints (list) – used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.

  • monotoneConstraintsMethod (str) – Monotone constraints method. basic, intermediate, or advanced.

  • monotonePenalty (float) – A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.

  • negBaggingFraction (float) – Negative Bagging fraction

  • numBatches (int) – If greater than 0, splits data into separate batches during training

  • numIterations (int) – Number of iterations, LightGBM constructs num_class * num_iterations trees

  • numLeaves (int) – Number of leaves

  • numTasks (int) – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

  • numThreads (int) – Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.

  • objective (str) – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

  • objectiveSeed (int) – Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.

  • otherRate (float) – The retain ratio of small gradient data. Only used in goss.

  • parallelism (str) – Tree learner parallelism, can be set to data_parallel or voting_parallel

  • passThroughArgs (str) – Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true

  • posBaggingFraction (float) – Positive Bagging fraction

  • predictDisableShapeCheck (bool) – control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data

  • predictionCol (str) – prediction column name

  • referenceDataset (list) – The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().

  • repartitionByGroupingColumn (bool) – Repartition training data according to grouping column, on by default.

  • samplingMode (str) – Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.

  • samplingSubsetSize (int) – Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.

  • seed (int) – Main seed, used to generate other seeds

  • skipDrop (float) – Probability of skipping the dropout procedure during a boosting iteration

  • slotNames (list) – List of slot names in the features column

  • timeout (float) – Timeout in seconds

  • topK (int) – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

  • topRate (float) – The retain ratio of large gradient data. Only used in goss.

  • uniformDrop (bool) – Set this to true to use uniform drop in dart mode

  • useBarrierExecutionMode (bool) – Barrier execution mode which uses a barrier stage, off by default.

  • useMissing (bool) – Set this to false to disable the special handle of missing value

  • useSingleDatasetMode (bool) – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.

  • validationIndicatorCol (str) – Indicates whether the row is for training or validation

  • verbosity (int) – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

  • weightCol (str) – The name of the weight column

  • xGBoostDartMode (bool) – Set this to true to use xgboost dart mode

  • zeroAsMissing (bool) – Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values

baggingFraction = Param(parent='undefined', name='baggingFraction', doc='Bagging fraction')
baggingFreq = Param(parent='undefined', name='baggingFreq', doc='Bagging frequency')
baggingSeed = Param(parent='undefined', name='baggingSeed', doc='Bagging seed')
binSampleCount = Param(parent='undefined', name='binSampleCount', doc='Number of samples considered at computing histogram bins')
boostFromAverage = Param(parent='undefined', name='boostFromAverage', doc='Adjusts initial score to the mean of labels for faster convergence')
boostingType = Param(parent='undefined', name='boostingType', doc='Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling). ')
catSmooth = Param(parent='undefined', name='catSmooth', doc='this can reduce the effect of noises in categorical features, especially for categories with few data')
categoricalSlotIndexes = Param(parent='undefined', name='categoricalSlotIndexes', doc='List of categorical column indexes, the slot index in the features column')
categoricalSlotNames = Param(parent='undefined', name='categoricalSlotNames', doc='List of categorical column slot names, the slot name in the features column')
catl2 = Param(parent='undefined', name='catl2', doc='L2 regularization in categorical split')
chunkSize = Param(parent='undefined', name='chunkSize', doc='Advanced parameter to specify the chunk size for copying Java data to native.  If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.')
dataRandomSeed = Param(parent='undefined', name='dataRandomSeed', doc='Random seed for sampling data to construct histogram bins.')
dataTransferMode = Param(parent='undefined', name='dataTransferMode', doc='Specify how SynapseML transfers data from Spark to LightGBM.  Values can be streaming, bulk. Default is bulk, which is the legacy mode.')
defaultListenPort = Param(parent='undefined', name='defaultListenPort', doc='The default listen port on executors, used for testing')
deterministic = Param(parent='undefined', name='deterministic', doc='Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters.  Note: setting this to true may slow down training.  To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true')
driverListenPort = Param(parent='undefined', name='driverListenPort', doc='The listen port on a driver. Default value is 0 (random)')
dropRate = Param(parent='undefined', name='dropRate', doc='Dropout rate: a fraction of previous trees to drop during the dropout')
dropSeed = Param(parent='undefined', name='dropSeed', doc='Random seed to choose dropping models. Only used in dart.')
earlyStoppingRound = Param(parent='undefined', name='earlyStoppingRound', doc='Early stopping round')
evalAt = Param(parent='undefined', name='evalAt', doc='NDCG and MAP evaluation positions, separated by comma')
executionMode = Param(parent='undefined', name='executionMode', doc='Deprecated. Please use dataTransferMode.')
extraSeed = Param(parent='undefined', name='extraSeed', doc='Random seed for selecting threshold when extra_trees is true')
featureFraction = Param(parent='undefined', name='featureFraction', doc='Feature fraction')
featureFractionByNode = Param(parent='undefined', name='featureFractionByNode', doc='Feature fraction by node')
featureFractionSeed = Param(parent='undefined', name='featureFractionSeed', doc='Feature fraction seed')
featuresCol = Param(parent='undefined', name='featuresCol', doc='features column name')
featuresShapCol = Param(parent='undefined', name='featuresShapCol', doc='Output SHAP vector column name after prediction containing the feature contribution values')
fobj = Param(parent='undefined', name='fobj', doc='Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).')
getBaggingFraction()[source]
Returns:

Bagging fraction

Return type:

baggingFraction

getBaggingFreq()[source]
Returns:

Bagging frequency

Return type:

baggingFreq

getBaggingSeed()[source]
Returns:

Bagging seed

Return type:

baggingSeed

getBinSampleCount()[source]
Returns:

Number of samples considered at computing histogram bins

Return type:

binSampleCount

getBoostFromAverage()[source]
Returns:

Adjusts initial score to the mean of labels for faster convergence

Return type:

boostFromAverage

getBoostingType()[source]
Returns:

Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

Return type:

boostingType

getCatSmooth()[source]
Returns:

this can reduce the effect of noises in categorical features, especially for categories with few data

Return type:

catSmooth

getCategoricalSlotIndexes()[source]
Returns:

List of categorical column indexes, the slot index in the features column

Return type:

categoricalSlotIndexes

getCategoricalSlotNames()[source]
Returns:

List of categorical column slot names, the slot name in the features column

Return type:

categoricalSlotNames

getCatl2()[source]
Returns:

L2 regularization in categorical split

Return type:

catl2

getChunkSize()[source]
Returns:

Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

Return type:

chunkSize

getDataRandomSeed()[source]
Returns:

Random seed for sampling data to construct histogram bins.

Return type:

dataRandomSeed

getDataTransferMode()[source]
Returns:

Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.

Return type:

dataTransferMode

getDefaultListenPort()[source]
Returns:

The default listen port on executors, used for testing

Return type:

defaultListenPort

getDeterministic()[source]
Returns:

Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true

Return type:

deterministic

getDriverListenPort()[source]
Returns:

The listen port on a driver. Default value is 0 (random)

Return type:

driverListenPort

getDropRate()[source]
Returns:

Dropout rate: a fraction of previous trees to drop during the dropout

Return type:

dropRate

getDropSeed()[source]
Returns:

Random seed to choose dropping models. Only used in dart.

Return type:

dropSeed

getEarlyStoppingRound()[source]
Returns:

Early stopping round

Return type:

earlyStoppingRound

getEvalAt()[source]
Returns:

NDCG and MAP evaluation positions, separated by comma

Return type:

evalAt

getExecutionMode()[source]
Returns:

Deprecated. Please use dataTransferMode.

Return type:

executionMode

getExtraSeed()[source]
Returns:

Random seed for selecting threshold when extra_trees is true

Return type:

extraSeed

getFeatureFraction()[source]
Returns:

Feature fraction

Return type:

featureFraction

getFeatureFractionByNode()[source]
Returns:

Feature fraction by node

Return type:

featureFractionByNode

getFeatureFractionSeed()[source]
Returns:

Feature fraction seed

Return type:

featureFractionSeed

getFeaturesCol()[source]
Returns:

features column name

Return type:

featuresCol

getFeaturesShapCol()[source]
Returns:

Output SHAP vector column name after prediction containing the feature contribution values

Return type:

featuresShapCol

getFobj()[source]
Returns:

Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

Return type:

fobj

getGroupCol()[source]
Returns:

The name of the group column

Return type:

groupCol

getImprovementTolerance()[source]
Returns:

Tolerance to consider improvement in metric

Return type:

improvementTolerance

getInitScoreCol()[source]
Returns:

The name of the initial score column, used for continued training

Return type:

initScoreCol

getIsEnableSparse()[source]
Returns:

Used to enable/disable sparse optimization

Return type:

isEnableSparse

getIsProvideTrainingMetric()[source]
Returns:

Whether output metric result over training dataset.

Return type:

isProvideTrainingMetric

static getJavaPackage()[source]

Returns package name String.

getLabelCol()[source]
Returns:

label column name

Return type:

labelCol

getLabelGain()[source]
Returns:

graded relevance for each label in NDCG

Return type:

labelGain

getLambdaL1()[source]
Returns:

L1 regularization

Return type:

lambdaL1

getLambdaL2()[source]
Returns:

L2 regularization

Return type:

lambdaL2

getLeafPredictionCol()[source]
Returns:

Predicted leaf indices’s column name

Return type:

leafPredictionCol

getLearningRate()[source]
Returns:

Learning rate or shrinkage rate

Return type:

learningRate

getMatrixType()[source]
Returns:

Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

Return type:

matrixType

getMaxBin()[source]
Returns:

Max bin

Return type:

maxBin

getMaxBinByFeature()[source]
Returns:

Max number of bins for each feature

Return type:

maxBinByFeature

getMaxCatThreshold()[source]
Returns:

limit number of split points considered for categorical features

Return type:

maxCatThreshold

getMaxCatToOnehot()[source]
Returns:

when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used

Return type:

maxCatToOnehot

getMaxDeltaStep()[source]
Returns:

Used to limit the max output of tree leaves

Return type:

maxDeltaStep

getMaxDepth()[source]
Returns:

Max depth

Return type:

maxDepth

getMaxDrop()[source]
Returns:

Max number of dropped trees during one boosting iteration

Return type:

maxDrop

getMaxNumClasses()[source]
Returns:

Number of max classes to infer numClass in multi-class classification.

Return type:

maxNumClasses

getMaxPosition()[source]
Returns:

optimized NDCG at this position

Return type:

maxPosition

getMaxStreamingOMPThreads()[source]
Returns:

Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.

Return type:

maxStreamingOMPThreads

getMetric()[source]
Returns:

Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

Return type:

metric

getMicroBatchSize()[source]
Returns:

Specify how many elements are sent in a streaming micro-batch.

Return type:

microBatchSize

getMinDataInLeaf()[source]
Returns:

Minimal number of data in one leaf. Can be used to deal with over-fitting.

Return type:

minDataInLeaf

getMinDataPerBin()[source]
Returns:

Minimal number of data inside one bin

Return type:

minDataPerBin

getMinDataPerGroup()[source]
Returns:

minimal number of data per categorical group

Return type:

minDataPerGroup

getMinGainToSplit()[source]
Returns:

The minimal gain to perform split

Return type:

minGainToSplit

getMinSumHessianInLeaf()[source]
Returns:

Minimal sum hessian in one leaf

Return type:

minSumHessianInLeaf

getModelString()[source]
Returns:

LightGBM model to retrain

Return type:

modelString

getMonotoneConstraints()[source]
Returns:

used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.

Return type:

monotoneConstraints

getMonotoneConstraintsMethod()[source]
Returns:

Monotone constraints method. basic, intermediate, or advanced.

Return type:

monotoneConstraintsMethod

getMonotonePenalty()[source]
Returns:

A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.

Return type:

monotonePenalty

getNegBaggingFraction()[source]
Returns:

Negative Bagging fraction

Return type:

negBaggingFraction

getNumBatches()[source]
Returns:

If greater than 0, splits data into separate batches during training

Return type:

numBatches

getNumIterations()[source]
Returns:

Number of iterations, LightGBM constructs num_class * num_iterations trees

Return type:

numIterations

getNumLeaves()[source]
Returns:

Number of leaves

Return type:

numLeaves

getNumTasks()[source]
Returns:

Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

Return type:

numTasks

getNumThreads()[source]
Returns:

Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.

Return type:

numThreads

getObjective()[source]
Returns:

The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

Return type:

objective

getObjectiveSeed()[source]
Returns:

Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.

Return type:

objectiveSeed

getOtherRate()[source]
Returns:

The retain ratio of small gradient data. Only used in goss.

Return type:

otherRate

getParallelism()[source]
Returns:

Tree learner parallelism, can be set to data_parallel or voting_parallel

Return type:

parallelism

getPassThroughArgs()[source]
Returns:

Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true

Return type:

passThroughArgs

getPosBaggingFraction()[source]
Returns:

Positive Bagging fraction

Return type:

posBaggingFraction

getPredictDisableShapeCheck()[source]
Returns:

control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data

Return type:

predictDisableShapeCheck

getPredictionCol()[source]
Returns:

prediction column name

Return type:

predictionCol

getReferenceDataset()[source]
Returns:

The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().

Return type:

referenceDataset

getRepartitionByGroupingColumn()[source]
Returns:

Repartition training data according to grouping column, on by default.

Return type:

repartitionByGroupingColumn

getSamplingMode()[source]
Returns:

Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.

Return type:

samplingMode

getSamplingSubsetSize()[source]
Returns:

Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.

Return type:

samplingSubsetSize

getSeed()[source]
Returns:

Main seed, used to generate other seeds

Return type:

seed

getSkipDrop()[source]
Returns:

Probability of skipping the dropout procedure during a boosting iteration

Return type:

skipDrop

getSlotNames()[source]
Returns:

List of slot names in the features column

Return type:

slotNames

getTimeout()[source]
Returns:

Timeout in seconds

Return type:

timeout

getTopK()[source]
Returns:

The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

Return type:

topK

getTopRate()[source]
Returns:

The retain ratio of large gradient data. Only used in goss.

Return type:

topRate

getUniformDrop()[source]
Returns:

Set this to true to use uniform drop in dart mode

Return type:

uniformDrop

getUseBarrierExecutionMode()[source]
Returns:

Barrier execution mode which uses a barrier stage, off by default.

Return type:

useBarrierExecutionMode

getUseMissing()[source]
Returns:

Set this to false to disable the special handle of missing value

Return type:

useMissing

getUseSingleDatasetMode()[source]
Returns:

Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.

Return type:

useSingleDatasetMode

getValidationIndicatorCol()[source]
Returns:

Indicates whether the row is for training or validation

Return type:

validationIndicatorCol

getVerbosity()[source]
Returns:

Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

Return type:

verbosity

getWeightCol()[source]
Returns:

The name of the weight column

Return type:

weightCol

getXGBoostDartMode()[source]
Returns:

Set this to true to use xgboost dart mode

Return type:

xGBoostDartMode

getZeroAsMissing()[source]
Returns:

Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values

Return type:

zeroAsMissing

groupCol = Param(parent='undefined', name='groupCol', doc='The name of the group column')
improvementTolerance = Param(parent='undefined', name='improvementTolerance', doc='Tolerance to consider improvement in metric')
initScoreCol = Param(parent='undefined', name='initScoreCol', doc='The name of the initial score column, used for continued training')
isEnableSparse = Param(parent='undefined', name='isEnableSparse', doc='Used to enable/disable sparse optimization')
isProvideTrainingMetric = Param(parent='undefined', name='isProvideTrainingMetric', doc='Whether output metric result over training dataset.')
labelCol = Param(parent='undefined', name='labelCol', doc='label column name')
labelGain = Param(parent='undefined', name='labelGain', doc='graded relevance for each label in NDCG')
lambdaL1 = Param(parent='undefined', name='lambdaL1', doc='L1 regularization')
lambdaL2 = Param(parent='undefined', name='lambdaL2', doc='L2 regularization')
leafPredictionCol = Param(parent='undefined', name='leafPredictionCol', doc="Predicted leaf indices's column name")
learningRate = Param(parent='undefined', name='learningRate', doc='Learning rate or shrinkage rate')
matrixType = Param(parent='undefined', name='matrixType', doc='Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense.  Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.')
maxBin = Param(parent='undefined', name='maxBin', doc='Max bin')
maxBinByFeature = Param(parent='undefined', name='maxBinByFeature', doc='Max number of bins for each feature')
maxCatThreshold = Param(parent='undefined', name='maxCatThreshold', doc='limit number of split points considered for categorical features')
maxCatToOnehot = Param(parent='undefined', name='maxCatToOnehot', doc='when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used')
maxDeltaStep = Param(parent='undefined', name='maxDeltaStep', doc='Used to limit the max output of tree leaves')
maxDepth = Param(parent='undefined', name='maxDepth', doc='Max depth')
maxDrop = Param(parent='undefined', name='maxDrop', doc='Max number of dropped trees during one boosting iteration')
maxNumClasses = Param(parent='undefined', name='maxNumClasses', doc='Number of max classes to infer numClass in multi-class classification.')
maxPosition = Param(parent='undefined', name='maxPosition', doc='optimized NDCG at this position')
maxStreamingOMPThreads = Param(parent='undefined', name='maxStreamingOMPThreads', doc="Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it's best to set a fixed value.")
metric = Param(parent='undefined', name='metric', doc='Metrics to be evaluated on the evaluation data.  Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv. ')
microBatchSize = Param(parent='undefined', name='microBatchSize', doc='Specify how many elements are sent in a streaming micro-batch.')
minDataInLeaf = Param(parent='undefined', name='minDataInLeaf', doc='Minimal number of data in one leaf. Can be used to deal with over-fitting.')
minDataPerBin = Param(parent='undefined', name='minDataPerBin', doc='Minimal number of data inside one bin')
minDataPerGroup = Param(parent='undefined', name='minDataPerGroup', doc='minimal number of data per categorical group')
minGainToSplit = Param(parent='undefined', name='minGainToSplit', doc='The minimal gain to perform split')
minSumHessianInLeaf = Param(parent='undefined', name='minSumHessianInLeaf', doc='Minimal sum hessian in one leaf')
modelString = Param(parent='undefined', name='modelString', doc='LightGBM model to retrain')
monotoneConstraints = Param(parent='undefined', name='monotoneConstraints', doc='used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.')
monotoneConstraintsMethod = Param(parent='undefined', name='monotoneConstraintsMethod', doc='Monotone constraints method. basic, intermediate, or advanced.')
monotonePenalty = Param(parent='undefined', name='monotonePenalty', doc='A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.')
negBaggingFraction = Param(parent='undefined', name='negBaggingFraction', doc='Negative Bagging fraction')
numBatches = Param(parent='undefined', name='numBatches', doc='If greater than 0, splits data into separate batches during training')
numIterations = Param(parent='undefined', name='numIterations', doc='Number of iterations, LightGBM constructs num_class * num_iterations trees')
numLeaves = Param(parent='undefined', name='numLeaves', doc='Number of leaves')
numTasks = Param(parent='undefined', name='numTasks', doc='Advanced parameter to specify the number of tasks.  SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.')
numThreads = Param(parent='undefined', name='numThreads', doc='Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.')
objective = Param(parent='undefined', name='objective', doc='The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova. ')
objectiveSeed = Param(parent='undefined', name='objectiveSeed', doc='Random seed for objectives, if random process is needed.  Currently used only for rank_xendcg objective.')
otherRate = Param(parent='undefined', name='otherRate', doc='The retain ratio of small gradient data. Only used in goss.')
parallelism = Param(parent='undefined', name='parallelism', doc='Tree learner parallelism, can be set to data_parallel or voting_parallel')
passThroughArgs = Param(parent='undefined', name='passThroughArgs', doc='Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true')
posBaggingFraction = Param(parent='undefined', name='posBaggingFraction', doc='Positive Bagging fraction')
predictDisableShapeCheck = Param(parent='undefined', name='predictDisableShapeCheck', doc='control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data')
predictionCol = Param(parent='undefined', name='predictionCol', doc='prediction column name')
classmethod read()[source]

Returns an MLReader instance for this class.

referenceDataset = Param(parent='undefined', name='referenceDataset', doc='The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().')
repartitionByGroupingColumn = Param(parent='undefined', name='repartitionByGroupingColumn', doc='Repartition training data according to grouping column, on by default.')
samplingMode = Param(parent='undefined', name='samplingMode', doc="Data sampling for streaming mode. Sampled data is used to define bins. 'global': sample from all data, 'subset': sample from first N rows, or 'fixed': Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.")
samplingSubsetSize = Param(parent='undefined', name='samplingSubsetSize', doc="Specify subset size N for the sampling mode 'subset'. 'binSampleCount' rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.")
seed = Param(parent='undefined', name='seed', doc='Main seed, used to generate other seeds')
setBaggingFraction(value)[source]
Parameters:

baggingFraction – Bagging fraction

setBaggingFreq(value)[source]
Parameters:

baggingFreq – Bagging frequency

setBaggingSeed(value)[source]
Parameters:

baggingSeed – Bagging seed

setBinSampleCount(value)[source]
Parameters:

binSampleCount – Number of samples considered at computing histogram bins

setBoostFromAverage(value)[source]
Parameters:

boostFromAverage – Adjusts initial score to the mean of labels for faster convergence

setBoostingType(value)[source]
Parameters:

boostingType – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

setCatSmooth(value)[source]
Parameters:

catSmooth – this can reduce the effect of noises in categorical features, especially for categories with few data

setCategoricalSlotIndexes(value)[source]
Parameters:

categoricalSlotIndexes – List of categorical column indexes, the slot index in the features column

setCategoricalSlotNames(value)[source]
Parameters:

categoricalSlotNames – List of categorical column slot names, the slot name in the features column

setCatl2(value)[source]
Parameters:

catl2 – L2 regularization in categorical split

setChunkSize(value)[source]
Parameters:

chunkSize – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

setDataRandomSeed(value)[source]
Parameters:

dataRandomSeed – Random seed for sampling data to construct histogram bins.

setDataTransferMode(value)[source]
Parameters:

dataTransferMode – Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.

setDefaultListenPort(value)[source]
Parameters:

defaultListenPort – The default listen port on executors, used for testing

setDeterministic(value)[source]
Parameters:

deterministic – Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true

setDriverListenPort(value)[source]
Parameters:

driverListenPort – The listen port on a driver. Default value is 0 (random)

setDropRate(value)[source]
Parameters:

dropRate – Dropout rate: a fraction of previous trees to drop during the dropout

setDropSeed(value)[source]
Parameters:

dropSeed – Random seed to choose dropping models. Only used in dart.

setEarlyStoppingRound(value)[source]
Parameters:

earlyStoppingRound – Early stopping round

setEvalAt(value)[source]
Parameters:

evalAt – NDCG and MAP evaluation positions, separated by comma

setExecutionMode(value)[source]
Parameters:

executionMode – Deprecated. Please use dataTransferMode.

setExtraSeed(value)[source]
Parameters:

extraSeed – Random seed for selecting threshold when extra_trees is true

setFeatureFraction(value)[source]
Parameters:

featureFraction – Feature fraction

setFeatureFractionByNode(value)[source]
Parameters:

featureFractionByNode – Feature fraction by node

setFeatureFractionSeed(value)[source]
Parameters:

featureFractionSeed – Feature fraction seed

setFeaturesCol(value)[source]
Parameters:

featuresCol – features column name

setFeaturesShapCol(value)[source]
Parameters:

featuresShapCol – Output SHAP vector column name after prediction containing the feature contribution values

setFobj(value)[source]
Parameters:

fobj – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

setGroupCol(value)[source]
Parameters:

groupCol – The name of the group column

setImprovementTolerance(value)[source]
Parameters:

improvementTolerance – Tolerance to consider improvement in metric

setInitScoreCol(value)[source]
Parameters:

initScoreCol – The name of the initial score column, used for continued training

setIsEnableSparse(value)[source]
Parameters:

isEnableSparse – Used to enable/disable sparse optimization

setIsProvideTrainingMetric(value)[source]
Parameters:

isProvideTrainingMetric – Whether output metric result over training dataset.

setLabelCol(value)[source]
Parameters:

labelCol – label column name

setLabelGain(value)[source]
Parameters:

labelGain – graded relevance for each label in NDCG

setLambdaL1(value)[source]
Parameters:

lambdaL1 – L1 regularization

setLambdaL2(value)[source]
Parameters:

lambdaL2 – L2 regularization

setLeafPredictionCol(value)[source]
Parameters:

leafPredictionCol – Predicted leaf indices’s column name

setLearningRate(value)[source]
Parameters:

learningRate – Learning rate or shrinkage rate

setMatrixType(value)[source]
Parameters:

matrixType – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

setMaxBin(value)[source]
Parameters:

maxBin – Max bin

setMaxBinByFeature(value)[source]
Parameters:

maxBinByFeature – Max number of bins for each feature

setMaxCatThreshold(value)[source]
Parameters:

maxCatThreshold – limit number of split points considered for categorical features

setMaxCatToOnehot(value)[source]
Parameters:

maxCatToOnehot – when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used

setMaxDeltaStep(value)[source]
Parameters:

maxDeltaStep – Used to limit the max output of tree leaves

setMaxDepth(value)[source]
Parameters:

maxDepth – Max depth

setMaxDrop(value)[source]
Parameters:

maxDrop – Max number of dropped trees during one boosting iteration

setMaxNumClasses(value)[source]
Parameters:

maxNumClasses – Number of max classes to infer numClass in multi-class classification.

setMaxPosition(value)[source]
Parameters:

maxPosition – optimized NDCG at this position

setMaxStreamingOMPThreads(value)[source]
Parameters:

maxStreamingOMPThreads – Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.

setMetric(value)[source]
Parameters:

metric – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

setMicroBatchSize(value)[source]
Parameters:

microBatchSize – Specify how many elements are sent in a streaming micro-batch.

setMinDataInLeaf(value)[source]
Parameters:

minDataInLeaf – Minimal number of data in one leaf. Can be used to deal with over-fitting.

setMinDataPerBin(value)[source]
Parameters:

minDataPerBin – Minimal number of data inside one bin

setMinDataPerGroup(value)[source]
Parameters:

minDataPerGroup – minimal number of data per categorical group

setMinGainToSplit(value)[source]
Parameters:

minGainToSplit – The minimal gain to perform split

setMinSumHessianInLeaf(value)[source]
Parameters:

minSumHessianInLeaf – Minimal sum hessian in one leaf

setModelString(value)[source]
Parameters:

modelString – LightGBM model to retrain

setMonotoneConstraints(value)[source]
Parameters:

monotoneConstraints – used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.

setMonotoneConstraintsMethod(value)[source]
Parameters:

monotoneConstraintsMethod – Monotone constraints method. basic, intermediate, or advanced.

setMonotonePenalty(value)[source]
Parameters:

monotonePenalty – A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.

setNegBaggingFraction(value)[source]
Parameters:

negBaggingFraction – Negative Bagging fraction

setNumBatches(value)[source]
Parameters:

numBatches – If greater than 0, splits data into separate batches during training

setNumIterations(value)[source]
Parameters:

numIterations – Number of iterations, LightGBM constructs num_class * num_iterations trees

setNumLeaves(value)[source]
Parameters:

numLeaves – Number of leaves

setNumTasks(value)[source]
Parameters:

numTasks – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

setNumThreads(value)[source]
Parameters:

numThreads – Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.

setObjective(value)[source]
Parameters:

objective – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

setObjectiveSeed(value)[source]
Parameters:

objectiveSeed – Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.

setOtherRate(value)[source]
Parameters:

otherRate – The retain ratio of small gradient data. Only used in goss.

setParallelism(value)[source]
Parameters:

parallelism – Tree learner parallelism, can be set to data_parallel or voting_parallel

setParams(baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', catSmooth=10.0, categoricalSlotIndexes=[], categoricalSlotNames=[], catl2=10.0, chunkSize=10000, dataRandomSeed=1, dataTransferMode='streaming', defaultListenPort=12400, deterministic=False, driverListenPort=0, dropRate=0.1, dropSeed=4, earlyStoppingRound=0, evalAt=[1, 2, 3, 4, 5], executionMode=None, extraSeed=6, featureFraction=1.0, featureFractionByNode=None, featureFractionSeed=2, featuresCol='features', featuresShapCol='', fobj=None, groupCol=None, improvementTolerance=0.0, initScoreCol=None, isEnableSparse=True, isProvideTrainingMetric=False, labelCol='label', labelGain=[], lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxCatThreshold=32, maxCatToOnehot=4, maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxNumClasses=100, maxPosition=20, maxStreamingOMPThreads=16, metric='', microBatchSize=100, minDataInLeaf=20, minDataPerBin=3, minDataPerGroup=100, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', monotoneConstraints=[], monotoneConstraintsMethod='basic', monotonePenalty=0.0, negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='lambdarank', objectiveSeed=5, otherRate=0.1, parallelism='data_parallel', passThroughArgs='', posBaggingFraction=1.0, predictDisableShapeCheck=False, predictionCol='prediction', referenceDataset=None, repartitionByGroupingColumn=True, samplingMode='subset', samplingSubsetSize=1000000, seed=None, skipDrop=0.5, slotNames=[], timeout=1200.0, topK=20, topRate=0.2, uniformDrop=False, useBarrierExecutionMode=False, useMissing=True, useSingleDatasetMode=True, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xGBoostDartMode=False, zeroAsMissing=False)[source]

Set the (keyword only) parameters

setPassThroughArgs(value)[source]
Parameters:

passThroughArgs – Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true

setPosBaggingFraction(value)[source]
Parameters:

posBaggingFraction – Positive Bagging fraction

setPredictDisableShapeCheck(value)[source]
Parameters:

predictDisableShapeCheck – control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data

setPredictionCol(value)[source]
Parameters:

predictionCol – prediction column name

setReferenceDataset(value)[source]
Parameters:

referenceDataset – The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().

setRepartitionByGroupingColumn(value)[source]
Parameters:

repartitionByGroupingColumn – Repartition training data according to grouping column, on by default.

setSamplingMode(value)[source]
Parameters:

samplingMode – Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.

setSamplingSubsetSize(value)[source]
Parameters:

samplingSubsetSize – Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.

setSeed(value)[source]
Parameters:

seed – Main seed, used to generate other seeds

setSkipDrop(value)[source]
Parameters:

skipDrop – Probability of skipping the dropout procedure during a boosting iteration

setSlotNames(value)[source]
Parameters:

slotNames – List of slot names in the features column

setTimeout(value)[source]
Parameters:

timeout – Timeout in seconds

setTopK(value)[source]
Parameters:

topK – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

setTopRate(value)[source]
Parameters:

topRate – The retain ratio of large gradient data. Only used in goss.

setUniformDrop(value)[source]
Parameters:

uniformDrop – Set this to true to use uniform drop in dart mode

setUseBarrierExecutionMode(value)[source]
Parameters:

useBarrierExecutionMode – Barrier execution mode which uses a barrier stage, off by default.

setUseMissing(value)[source]
Parameters:

useMissing – Set this to false to disable the special handle of missing value

setUseSingleDatasetMode(value)[source]
Parameters:

useSingleDatasetMode – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.

setValidationIndicatorCol(value)[source]
Parameters:

validationIndicatorCol – Indicates whether the row is for training or validation

setVerbosity(value)[source]
Parameters:

verbosity – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

setWeightCol(value)[source]
Parameters:

weightCol – The name of the weight column

setXGBoostDartMode(value)[source]
Parameters:

xGBoostDartMode – Set this to true to use xgboost dart mode

setZeroAsMissing(value)[source]
Parameters:

zeroAsMissing – Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values

skipDrop = Param(parent='undefined', name='skipDrop', doc='Probability of skipping the dropout procedure during a boosting iteration')
slotNames = Param(parent='undefined', name='slotNames', doc='List of slot names in the features column')
timeout = Param(parent='undefined', name='timeout', doc='Timeout in seconds')
topK = Param(parent='undefined', name='topK', doc='The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0')
topRate = Param(parent='undefined', name='topRate', doc='The retain ratio of large gradient data. Only used in goss.')
uniformDrop = Param(parent='undefined', name='uniformDrop', doc='Set this to true to use uniform drop in dart mode')
useBarrierExecutionMode = Param(parent='undefined', name='useBarrierExecutionMode', doc='Barrier execution mode which uses a barrier stage, off by default.')
useMissing = Param(parent='undefined', name='useMissing', doc='Set this to false to disable the special handle of missing value')
useSingleDatasetMode = Param(parent='undefined', name='useSingleDatasetMode', doc='Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.')
validationIndicatorCol = Param(parent='undefined', name='validationIndicatorCol', doc='Indicates whether the row is for training or validation')
verbosity = Param(parent='undefined', name='verbosity', doc='Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug')
weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')
xGBoostDartMode = Param(parent='undefined', name='xGBoostDartMode', doc='Set this to true to use xgboost dart mode')
zeroAsMissing = Param(parent='undefined', name='zeroAsMissing', doc='Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values')

synapse.ml.lightgbm.LightGBMRankerModel module

class synapse.ml.lightgbm.LightGBMRankerModel.LightGBMRankerModel(java_obj=None, featuresCol='features', featuresShapCol='', labelCol='label', leafPredictionCol='', lightGBMBooster=None, numIterations=- 1, predictDisableShapeCheck=False, predictionCol='prediction', startIteration=0)[source]

Bases: LightGBMModelMixin, _LightGBMRankerModel

getBoosterNumClasses()[source]

Get the number of classes from the booster.

Returns:

The number of classes.

static loadNativeModelFromFile(filename)[source]

Load the model from a native LightGBM text file.

static loadNativeModelFromString(model)[source]

Load the model from a native LightGBM model string.

synapse.ml.lightgbm.LightGBMRegressionModel module

class synapse.ml.lightgbm.LightGBMRegressionModel.LightGBMRegressionModel(java_obj=None, featuresCol='features', featuresShapCol='', labelCol='label', leafPredictionCol='', lightGBMBooster=None, numIterations=- 1, predictDisableShapeCheck=False, predictionCol='prediction', startIteration=0)[source]

Bases: LightGBMModelMixin, _LightGBMRegressionModel

static loadNativeModelFromFile(filename)[source]

Load the model from a native LightGBM text file.

static loadNativeModelFromString(model)[source]

Load the model from a native LightGBM model string.

synapse.ml.lightgbm.LightGBMRegressor module

class synapse.ml.lightgbm.LightGBMRegressor.LightGBMRegressor(java_obj=None, alpha=0.9, baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', catSmooth=10.0, categoricalSlotIndexes=[], categoricalSlotNames=[], catl2=10.0, chunkSize=10000, dataRandomSeed=1, dataTransferMode='streaming', defaultListenPort=12400, deterministic=False, driverListenPort=0, dropRate=0.1, dropSeed=4, earlyStoppingRound=0, executionMode=None, extraSeed=6, featureFraction=1.0, featureFractionByNode=None, featureFractionSeed=2, featuresCol='features', featuresShapCol='', fobj=None, improvementTolerance=0.0, initScoreCol=None, isEnableSparse=True, isProvideTrainingMetric=False, labelCol='label', lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxCatThreshold=32, maxCatToOnehot=4, maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxNumClasses=100, maxStreamingOMPThreads=16, metric='', microBatchSize=100, minDataInLeaf=20, minDataPerBin=3, minDataPerGroup=100, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', monotoneConstraints=[], monotoneConstraintsMethod='basic', monotonePenalty=0.0, negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='regression', objectiveSeed=5, otherRate=0.1, parallelism='data_parallel', passThroughArgs='', posBaggingFraction=1.0, predictDisableShapeCheck=False, predictionCol='prediction', referenceDataset=None, repartitionByGroupingColumn=True, samplingMode='subset', samplingSubsetSize=1000000, seed=None, skipDrop=0.5, slotNames=[], timeout=1200.0, topK=20, topRate=0.2, tweedieVariancePower=1.5, uniformDrop=False, useBarrierExecutionMode=False, useMissing=True, useSingleDatasetMode=True, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xGBoostDartMode=False, zeroAsMissing=False)[source]

Bases: ComplexParamsMixin, JavaMLReadable, JavaMLWritable, JavaEstimator

Parameters:
  • alpha (float) – parameter for Huber loss and Quantile regression

  • baggingFraction (float) – Bagging fraction

  • baggingFreq (int) – Bagging frequency

  • baggingSeed (int) – Bagging seed

  • binSampleCount (int) – Number of samples considered at computing histogram bins

  • boostFromAverage (bool) – Adjusts initial score to the mean of labels for faster convergence

  • boostingType (str) – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

  • catSmooth (float) – this can reduce the effect of noises in categorical features, especially for categories with few data

  • categoricalSlotIndexes (list) – List of categorical column indexes, the slot index in the features column

  • categoricalSlotNames (list) – List of categorical column slot names, the slot name in the features column

  • catl2 (float) – L2 regularization in categorical split

  • chunkSize (int) – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

  • dataRandomSeed (int) – Random seed for sampling data to construct histogram bins.

  • dataTransferMode (str) – Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.

  • defaultListenPort (int) – The default listen port on executors, used for testing

  • deterministic (bool) – Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true

  • driverListenPort (int) – The listen port on a driver. Default value is 0 (random)

  • dropRate (float) – Dropout rate: a fraction of previous trees to drop during the dropout

  • dropSeed (int) – Random seed to choose dropping models. Only used in dart.

  • earlyStoppingRound (int) – Early stopping round

  • executionMode (str) – Deprecated. Please use dataTransferMode.

  • extraSeed (int) – Random seed for selecting threshold when extra_trees is true

  • featureFraction (float) – Feature fraction

  • featureFractionByNode (float) – Feature fraction by node

  • featureFractionSeed (int) – Feature fraction seed

  • featuresCol (str) – features column name

  • featuresShapCol (str) – Output SHAP vector column name after prediction containing the feature contribution values

  • fobj (object) – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

  • improvementTolerance (float) – Tolerance to consider improvement in metric

  • initScoreCol (str) – The name of the initial score column, used for continued training

  • isEnableSparse (bool) – Used to enable/disable sparse optimization

  • isProvideTrainingMetric (bool) – Whether output metric result over training dataset.

  • labelCol (str) – label column name

  • lambdaL1 (float) – L1 regularization

  • lambdaL2 (float) – L2 regularization

  • leafPredictionCol (str) – Predicted leaf indices’s column name

  • learningRate (float) – Learning rate or shrinkage rate

  • matrixType (str) – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

  • maxBin (int) – Max bin

  • maxBinByFeature (list) – Max number of bins for each feature

  • maxCatThreshold (int) – limit number of split points considered for categorical features

  • maxCatToOnehot (int) – when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used

  • maxDeltaStep (float) – Used to limit the max output of tree leaves

  • maxDepth (int) – Max depth

  • maxDrop (int) – Max number of dropped trees during one boosting iteration

  • maxNumClasses (int) – Number of max classes to infer numClass in multi-class classification.

  • maxStreamingOMPThreads (int) – Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.

  • metric (str) – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

  • microBatchSize (int) – Specify how many elements are sent in a streaming micro-batch.

  • minDataInLeaf (int) – Minimal number of data in one leaf. Can be used to deal with over-fitting.

  • minDataPerBin (int) – Minimal number of data inside one bin

  • minDataPerGroup (int) – minimal number of data per categorical group

  • minGainToSplit (float) – The minimal gain to perform split

  • minSumHessianInLeaf (float) – Minimal sum hessian in one leaf

  • modelString (str) – LightGBM model to retrain

  • monotoneConstraints (list) – used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.

  • monotoneConstraintsMethod (str) – Monotone constraints method. basic, intermediate, or advanced.

  • monotonePenalty (float) – A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.

  • negBaggingFraction (float) – Negative Bagging fraction

  • numBatches (int) – If greater than 0, splits data into separate batches during training

  • numIterations (int) – Number of iterations, LightGBM constructs num_class * num_iterations trees

  • numLeaves (int) – Number of leaves

  • numTasks (int) – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

  • numThreads (int) – Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.

  • objective (str) – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

  • objectiveSeed (int) – Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.

  • otherRate (float) – The retain ratio of small gradient data. Only used in goss.

  • parallelism (str) – Tree learner parallelism, can be set to data_parallel or voting_parallel

  • passThroughArgs (str) – Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true

  • posBaggingFraction (float) – Positive Bagging fraction

  • predictDisableShapeCheck (bool) – control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data

  • predictionCol (str) – prediction column name

  • referenceDataset (list) – The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().

  • repartitionByGroupingColumn (bool) – Repartition training data according to grouping column, on by default.

  • samplingMode (str) – Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.

  • samplingSubsetSize (int) – Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.

  • seed (int) – Main seed, used to generate other seeds

  • skipDrop (float) – Probability of skipping the dropout procedure during a boosting iteration

  • slotNames (list) – List of slot names in the features column

  • timeout (float) – Timeout in seconds

  • topK (int) – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

  • topRate (float) – The retain ratio of large gradient data. Only used in goss.

  • tweedieVariancePower (float) – control the variance of tweedie distribution, must be between 1 and 2

  • uniformDrop (bool) – Set this to true to use uniform drop in dart mode

  • useBarrierExecutionMode (bool) – Barrier execution mode which uses a barrier stage, off by default.

  • useMissing (bool) – Set this to false to disable the special handle of missing value

  • useSingleDatasetMode (bool) – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.

  • validationIndicatorCol (str) – Indicates whether the row is for training or validation

  • verbosity (int) – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

  • weightCol (str) – The name of the weight column

  • xGBoostDartMode (bool) – Set this to true to use xgboost dart mode

  • zeroAsMissing (bool) – Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values

alpha = Param(parent='undefined', name='alpha', doc='parameter for Huber loss and Quantile regression')
baggingFraction = Param(parent='undefined', name='baggingFraction', doc='Bagging fraction')
baggingFreq = Param(parent='undefined', name='baggingFreq', doc='Bagging frequency')
baggingSeed = Param(parent='undefined', name='baggingSeed', doc='Bagging seed')
binSampleCount = Param(parent='undefined', name='binSampleCount', doc='Number of samples considered at computing histogram bins')
boostFromAverage = Param(parent='undefined', name='boostFromAverage', doc='Adjusts initial score to the mean of labels for faster convergence')
boostingType = Param(parent='undefined', name='boostingType', doc='Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling). ')
catSmooth = Param(parent='undefined', name='catSmooth', doc='this can reduce the effect of noises in categorical features, especially for categories with few data')
categoricalSlotIndexes = Param(parent='undefined', name='categoricalSlotIndexes', doc='List of categorical column indexes, the slot index in the features column')
categoricalSlotNames = Param(parent='undefined', name='categoricalSlotNames', doc='List of categorical column slot names, the slot name in the features column')
catl2 = Param(parent='undefined', name='catl2', doc='L2 regularization in categorical split')
chunkSize = Param(parent='undefined', name='chunkSize', doc='Advanced parameter to specify the chunk size for copying Java data to native.  If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.')
dataRandomSeed = Param(parent='undefined', name='dataRandomSeed', doc='Random seed for sampling data to construct histogram bins.')
dataTransferMode = Param(parent='undefined', name='dataTransferMode', doc='Specify how SynapseML transfers data from Spark to LightGBM.  Values can be streaming, bulk. Default is bulk, which is the legacy mode.')
defaultListenPort = Param(parent='undefined', name='defaultListenPort', doc='The default listen port on executors, used for testing')
deterministic = Param(parent='undefined', name='deterministic', doc='Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters.  Note: setting this to true may slow down training.  To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true')
driverListenPort = Param(parent='undefined', name='driverListenPort', doc='The listen port on a driver. Default value is 0 (random)')
dropRate = Param(parent='undefined', name='dropRate', doc='Dropout rate: a fraction of previous trees to drop during the dropout')
dropSeed = Param(parent='undefined', name='dropSeed', doc='Random seed to choose dropping models. Only used in dart.')
earlyStoppingRound = Param(parent='undefined', name='earlyStoppingRound', doc='Early stopping round')
executionMode = Param(parent='undefined', name='executionMode', doc='Deprecated. Please use dataTransferMode.')
extraSeed = Param(parent='undefined', name='extraSeed', doc='Random seed for selecting threshold when extra_trees is true')
featureFraction = Param(parent='undefined', name='featureFraction', doc='Feature fraction')
featureFractionByNode = Param(parent='undefined', name='featureFractionByNode', doc='Feature fraction by node')
featureFractionSeed = Param(parent='undefined', name='featureFractionSeed', doc='Feature fraction seed')
featuresCol = Param(parent='undefined', name='featuresCol', doc='features column name')
featuresShapCol = Param(parent='undefined', name='featuresShapCol', doc='Output SHAP vector column name after prediction containing the feature contribution values')
fobj = Param(parent='undefined', name='fobj', doc='Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).')
getAlpha()[source]
Returns:

parameter for Huber loss and Quantile regression

Return type:

alpha

getBaggingFraction()[source]
Returns:

Bagging fraction

Return type:

baggingFraction

getBaggingFreq()[source]
Returns:

Bagging frequency

Return type:

baggingFreq

getBaggingSeed()[source]
Returns:

Bagging seed

Return type:

baggingSeed

getBinSampleCount()[source]
Returns:

Number of samples considered at computing histogram bins

Return type:

binSampleCount

getBoostFromAverage()[source]
Returns:

Adjusts initial score to the mean of labels for faster convergence

Return type:

boostFromAverage

getBoostingType()[source]
Returns:

Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

Return type:

boostingType

getCatSmooth()[source]
Returns:

this can reduce the effect of noises in categorical features, especially for categories with few data

Return type:

catSmooth

getCategoricalSlotIndexes()[source]
Returns:

List of categorical column indexes, the slot index in the features column

Return type:

categoricalSlotIndexes

getCategoricalSlotNames()[source]
Returns:

List of categorical column slot names, the slot name in the features column

Return type:

categoricalSlotNames

getCatl2()[source]
Returns:

L2 regularization in categorical split

Return type:

catl2

getChunkSize()[source]
Returns:

Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

Return type:

chunkSize

getDataRandomSeed()[source]
Returns:

Random seed for sampling data to construct histogram bins.

Return type:

dataRandomSeed

getDataTransferMode()[source]
Returns:

Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.

Return type:

dataTransferMode

getDefaultListenPort()[source]
Returns:

The default listen port on executors, used for testing

Return type:

defaultListenPort

getDeterministic()[source]
Returns:

Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true

Return type:

deterministic

getDriverListenPort()[source]
Returns:

The listen port on a driver. Default value is 0 (random)

Return type:

driverListenPort

getDropRate()[source]
Returns:

Dropout rate: a fraction of previous trees to drop during the dropout

Return type:

dropRate

getDropSeed()[source]
Returns:

Random seed to choose dropping models. Only used in dart.

Return type:

dropSeed

getEarlyStoppingRound()[source]
Returns:

Early stopping round

Return type:

earlyStoppingRound

getExecutionMode()[source]
Returns:

Deprecated. Please use dataTransferMode.

Return type:

executionMode

getExtraSeed()[source]
Returns:

Random seed for selecting threshold when extra_trees is true

Return type:

extraSeed

getFeatureFraction()[source]
Returns:

Feature fraction

Return type:

featureFraction

getFeatureFractionByNode()[source]
Returns:

Feature fraction by node

Return type:

featureFractionByNode

getFeatureFractionSeed()[source]
Returns:

Feature fraction seed

Return type:

featureFractionSeed

getFeaturesCol()[source]
Returns:

features column name

Return type:

featuresCol

getFeaturesShapCol()[source]
Returns:

Output SHAP vector column name after prediction containing the feature contribution values

Return type:

featuresShapCol

getFobj()[source]
Returns:

Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

Return type:

fobj

getImprovementTolerance()[source]
Returns:

Tolerance to consider improvement in metric

Return type:

improvementTolerance

getInitScoreCol()[source]
Returns:

The name of the initial score column, used for continued training

Return type:

initScoreCol

getIsEnableSparse()[source]
Returns:

Used to enable/disable sparse optimization

Return type:

isEnableSparse

getIsProvideTrainingMetric()[source]
Returns:

Whether output metric result over training dataset.

Return type:

isProvideTrainingMetric

static getJavaPackage()[source]

Returns package name String.

getLabelCol()[source]
Returns:

label column name

Return type:

labelCol

getLambdaL1()[source]
Returns:

L1 regularization

Return type:

lambdaL1

getLambdaL2()[source]
Returns:

L2 regularization

Return type:

lambdaL2

getLeafPredictionCol()[source]
Returns:

Predicted leaf indices’s column name

Return type:

leafPredictionCol

getLearningRate()[source]
Returns:

Learning rate or shrinkage rate

Return type:

learningRate

getMatrixType()[source]
Returns:

Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

Return type:

matrixType

getMaxBin()[source]
Returns:

Max bin

Return type:

maxBin

getMaxBinByFeature()[source]
Returns:

Max number of bins for each feature

Return type:

maxBinByFeature

getMaxCatThreshold()[source]
Returns:

limit number of split points considered for categorical features

Return type:

maxCatThreshold

getMaxCatToOnehot()[source]
Returns:

when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used

Return type:

maxCatToOnehot

getMaxDeltaStep()[source]
Returns:

Used to limit the max output of tree leaves

Return type:

maxDeltaStep

getMaxDepth()[source]
Returns:

Max depth

Return type:

maxDepth

getMaxDrop()[source]
Returns:

Max number of dropped trees during one boosting iteration

Return type:

maxDrop

getMaxNumClasses()[source]
Returns:

Number of max classes to infer numClass in multi-class classification.

Return type:

maxNumClasses

getMaxStreamingOMPThreads()[source]
Returns:

Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.

Return type:

maxStreamingOMPThreads

getMetric()[source]
Returns:

Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

Return type:

metric

getMicroBatchSize()[source]
Returns:

Specify how many elements are sent in a streaming micro-batch.

Return type:

microBatchSize

getMinDataInLeaf()[source]
Returns:

Minimal number of data in one leaf. Can be used to deal with over-fitting.

Return type:

minDataInLeaf

getMinDataPerBin()[source]
Returns:

Minimal number of data inside one bin

Return type:

minDataPerBin

getMinDataPerGroup()[source]
Returns:

minimal number of data per categorical group

Return type:

minDataPerGroup

getMinGainToSplit()[source]
Returns:

The minimal gain to perform split

Return type:

minGainToSplit

getMinSumHessianInLeaf()[source]
Returns:

Minimal sum hessian in one leaf

Return type:

minSumHessianInLeaf

getModelString()[source]
Returns:

LightGBM model to retrain

Return type:

modelString

getMonotoneConstraints()[source]
Returns:

used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.

Return type:

monotoneConstraints

getMonotoneConstraintsMethod()[source]
Returns:

Monotone constraints method. basic, intermediate, or advanced.

Return type:

monotoneConstraintsMethod

getMonotonePenalty()[source]
Returns:

A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.

Return type:

monotonePenalty

getNegBaggingFraction()[source]
Returns:

Negative Bagging fraction

Return type:

negBaggingFraction

getNumBatches()[source]
Returns:

If greater than 0, splits data into separate batches during training

Return type:

numBatches

getNumIterations()[source]
Returns:

Number of iterations, LightGBM constructs num_class * num_iterations trees

Return type:

numIterations

getNumLeaves()[source]
Returns:

Number of leaves

Return type:

numLeaves

getNumTasks()[source]
Returns:

Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

Return type:

numTasks

getNumThreads()[source]
Returns:

Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.

Return type:

numThreads

getObjective()[source]
Returns:

The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

Return type:

objective

getObjectiveSeed()[source]
Returns:

Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.

Return type:

objectiveSeed

getOtherRate()[source]
Returns:

The retain ratio of small gradient data. Only used in goss.

Return type:

otherRate

getParallelism()[source]
Returns:

Tree learner parallelism, can be set to data_parallel or voting_parallel

Return type:

parallelism

getPassThroughArgs()[source]
Returns:

Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true

Return type:

passThroughArgs

getPosBaggingFraction()[source]
Returns:

Positive Bagging fraction

Return type:

posBaggingFraction

getPredictDisableShapeCheck()[source]
Returns:

control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data

Return type:

predictDisableShapeCheck

getPredictionCol()[source]
Returns:

prediction column name

Return type:

predictionCol

getReferenceDataset()[source]
Returns:

The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().

Return type:

referenceDataset

getRepartitionByGroupingColumn()[source]
Returns:

Repartition training data according to grouping column, on by default.

Return type:

repartitionByGroupingColumn

getSamplingMode()[source]
Returns:

Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.

Return type:

samplingMode

getSamplingSubsetSize()[source]
Returns:

Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.

Return type:

samplingSubsetSize

getSeed()[source]
Returns:

Main seed, used to generate other seeds

Return type:

seed

getSkipDrop()[source]
Returns:

Probability of skipping the dropout procedure during a boosting iteration

Return type:

skipDrop

getSlotNames()[source]
Returns:

List of slot names in the features column

Return type:

slotNames

getTimeout()[source]
Returns:

Timeout in seconds

Return type:

timeout

getTopK()[source]
Returns:

The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

Return type:

topK

getTopRate()[source]
Returns:

The retain ratio of large gradient data. Only used in goss.

Return type:

topRate

getTweedieVariancePower()[source]
Returns:

control the variance of tweedie distribution, must be between 1 and 2

Return type:

tweedieVariancePower

getUniformDrop()[source]
Returns:

Set this to true to use uniform drop in dart mode

Return type:

uniformDrop

getUseBarrierExecutionMode()[source]
Returns:

Barrier execution mode which uses a barrier stage, off by default.

Return type:

useBarrierExecutionMode

getUseMissing()[source]
Returns:

Set this to false to disable the special handle of missing value

Return type:

useMissing

getUseSingleDatasetMode()[source]
Returns:

Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.

Return type:

useSingleDatasetMode

getValidationIndicatorCol()[source]
Returns:

Indicates whether the row is for training or validation

Return type:

validationIndicatorCol

getVerbosity()[source]
Returns:

Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

Return type:

verbosity

getWeightCol()[source]
Returns:

The name of the weight column

Return type:

weightCol

getXGBoostDartMode()[source]
Returns:

Set this to true to use xgboost dart mode

Return type:

xGBoostDartMode

getZeroAsMissing()[source]
Returns:

Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values

Return type:

zeroAsMissing

improvementTolerance = Param(parent='undefined', name='improvementTolerance', doc='Tolerance to consider improvement in metric')
initScoreCol = Param(parent='undefined', name='initScoreCol', doc='The name of the initial score column, used for continued training')
isEnableSparse = Param(parent='undefined', name='isEnableSparse', doc='Used to enable/disable sparse optimization')
isProvideTrainingMetric = Param(parent='undefined', name='isProvideTrainingMetric', doc='Whether output metric result over training dataset.')
labelCol = Param(parent='undefined', name='labelCol', doc='label column name')
lambdaL1 = Param(parent='undefined', name='lambdaL1', doc='L1 regularization')
lambdaL2 = Param(parent='undefined', name='lambdaL2', doc='L2 regularization')
leafPredictionCol = Param(parent='undefined', name='leafPredictionCol', doc="Predicted leaf indices's column name")
learningRate = Param(parent='undefined', name='learningRate', doc='Learning rate or shrinkage rate')
matrixType = Param(parent='undefined', name='matrixType', doc='Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense.  Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.')
maxBin = Param(parent='undefined', name='maxBin', doc='Max bin')
maxBinByFeature = Param(parent='undefined', name='maxBinByFeature', doc='Max number of bins for each feature')
maxCatThreshold = Param(parent='undefined', name='maxCatThreshold', doc='limit number of split points considered for categorical features')
maxCatToOnehot = Param(parent='undefined', name='maxCatToOnehot', doc='when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used')
maxDeltaStep = Param(parent='undefined', name='maxDeltaStep', doc='Used to limit the max output of tree leaves')
maxDepth = Param(parent='undefined', name='maxDepth', doc='Max depth')
maxDrop = Param(parent='undefined', name='maxDrop', doc='Max number of dropped trees during one boosting iteration')
maxNumClasses = Param(parent='undefined', name='maxNumClasses', doc='Number of max classes to infer numClass in multi-class classification.')
maxStreamingOMPThreads = Param(parent='undefined', name='maxStreamingOMPThreads', doc="Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it's best to set a fixed value.")
metric = Param(parent='undefined', name='metric', doc='Metrics to be evaluated on the evaluation data.  Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv. ')
microBatchSize = Param(parent='undefined', name='microBatchSize', doc='Specify how many elements are sent in a streaming micro-batch.')
minDataInLeaf = Param(parent='undefined', name='minDataInLeaf', doc='Minimal number of data in one leaf. Can be used to deal with over-fitting.')
minDataPerBin = Param(parent='undefined', name='minDataPerBin', doc='Minimal number of data inside one bin')
minDataPerGroup = Param(parent='undefined', name='minDataPerGroup', doc='minimal number of data per categorical group')
minGainToSplit = Param(parent='undefined', name='minGainToSplit', doc='The minimal gain to perform split')
minSumHessianInLeaf = Param(parent='undefined', name='minSumHessianInLeaf', doc='Minimal sum hessian in one leaf')
modelString = Param(parent='undefined', name='modelString', doc='LightGBM model to retrain')
monotoneConstraints = Param(parent='undefined', name='monotoneConstraints', doc='used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.')
monotoneConstraintsMethod = Param(parent='undefined', name='monotoneConstraintsMethod', doc='Monotone constraints method. basic, intermediate, or advanced.')
monotonePenalty = Param(parent='undefined', name='monotonePenalty', doc='A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.')
negBaggingFraction = Param(parent='undefined', name='negBaggingFraction', doc='Negative Bagging fraction')
numBatches = Param(parent='undefined', name='numBatches', doc='If greater than 0, splits data into separate batches during training')
numIterations = Param(parent='undefined', name='numIterations', doc='Number of iterations, LightGBM constructs num_class * num_iterations trees')
numLeaves = Param(parent='undefined', name='numLeaves', doc='Number of leaves')
numTasks = Param(parent='undefined', name='numTasks', doc='Advanced parameter to specify the number of tasks.  SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.')
numThreads = Param(parent='undefined', name='numThreads', doc='Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.')
objective = Param(parent='undefined', name='objective', doc='The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova. ')
objectiveSeed = Param(parent='undefined', name='objectiveSeed', doc='Random seed for objectives, if random process is needed.  Currently used only for rank_xendcg objective.')
otherRate = Param(parent='undefined', name='otherRate', doc='The retain ratio of small gradient data. Only used in goss.')
parallelism = Param(parent='undefined', name='parallelism', doc='Tree learner parallelism, can be set to data_parallel or voting_parallel')
passThroughArgs = Param(parent='undefined', name='passThroughArgs', doc='Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true')
posBaggingFraction = Param(parent='undefined', name='posBaggingFraction', doc='Positive Bagging fraction')
predictDisableShapeCheck = Param(parent='undefined', name='predictDisableShapeCheck', doc='control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data')
predictionCol = Param(parent='undefined', name='predictionCol', doc='prediction column name')
classmethod read()[source]

Returns an MLReader instance for this class.

referenceDataset = Param(parent='undefined', name='referenceDataset', doc='The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().')
repartitionByGroupingColumn = Param(parent='undefined', name='repartitionByGroupingColumn', doc='Repartition training data according to grouping column, on by default.')
samplingMode = Param(parent='undefined', name='samplingMode', doc="Data sampling for streaming mode. Sampled data is used to define bins. 'global': sample from all data, 'subset': sample from first N rows, or 'fixed': Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.")
samplingSubsetSize = Param(parent='undefined', name='samplingSubsetSize', doc="Specify subset size N for the sampling mode 'subset'. 'binSampleCount' rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.")
seed = Param(parent='undefined', name='seed', doc='Main seed, used to generate other seeds')
setAlpha(value)[source]
Parameters:

alpha – parameter for Huber loss and Quantile regression

setBaggingFraction(value)[source]
Parameters:

baggingFraction – Bagging fraction

setBaggingFreq(value)[source]
Parameters:

baggingFreq – Bagging frequency

setBaggingSeed(value)[source]
Parameters:

baggingSeed – Bagging seed

setBinSampleCount(value)[source]
Parameters:

binSampleCount – Number of samples considered at computing histogram bins

setBoostFromAverage(value)[source]
Parameters:

boostFromAverage – Adjusts initial score to the mean of labels for faster convergence

setBoostingType(value)[source]
Parameters:

boostingType – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

setCatSmooth(value)[source]
Parameters:

catSmooth – this can reduce the effect of noises in categorical features, especially for categories with few data

setCategoricalSlotIndexes(value)[source]
Parameters:

categoricalSlotIndexes – List of categorical column indexes, the slot index in the features column

setCategoricalSlotNames(value)[source]
Parameters:

categoricalSlotNames – List of categorical column slot names, the slot name in the features column

setCatl2(value)[source]
Parameters:

catl2 – L2 regularization in categorical split

setChunkSize(value)[source]
Parameters:

chunkSize – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

setDataRandomSeed(value)[source]
Parameters:

dataRandomSeed – Random seed for sampling data to construct histogram bins.

setDataTransferMode(value)[source]
Parameters:

dataTransferMode – Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.

setDefaultListenPort(value)[source]
Parameters:

defaultListenPort – The default listen port on executors, used for testing

setDeterministic(value)[source]
Parameters:

deterministic – Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true

setDriverListenPort(value)[source]
Parameters:

driverListenPort – The listen port on a driver. Default value is 0 (random)

setDropRate(value)[source]
Parameters:

dropRate – Dropout rate: a fraction of previous trees to drop during the dropout

setDropSeed(value)[source]
Parameters:

dropSeed – Random seed to choose dropping models. Only used in dart.

setEarlyStoppingRound(value)[source]
Parameters:

earlyStoppingRound – Early stopping round

setExecutionMode(value)[source]
Parameters:

executionMode – Deprecated. Please use dataTransferMode.

setExtraSeed(value)[source]
Parameters:

extraSeed – Random seed for selecting threshold when extra_trees is true

setFeatureFraction(value)[source]
Parameters:

featureFraction – Feature fraction

setFeatureFractionByNode(value)[source]
Parameters:

featureFractionByNode – Feature fraction by node

setFeatureFractionSeed(value)[source]
Parameters:

featureFractionSeed – Feature fraction seed

setFeaturesCol(value)[source]
Parameters:

featuresCol – features column name

setFeaturesShapCol(value)[source]
Parameters:

featuresShapCol – Output SHAP vector column name after prediction containing the feature contribution values

setFobj(value)[source]
Parameters:

fobj – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

setImprovementTolerance(value)[source]
Parameters:

improvementTolerance – Tolerance to consider improvement in metric

setInitScoreCol(value)[source]
Parameters:

initScoreCol – The name of the initial score column, used for continued training

setIsEnableSparse(value)[source]
Parameters:

isEnableSparse – Used to enable/disable sparse optimization

setIsProvideTrainingMetric(value)[source]
Parameters:

isProvideTrainingMetric – Whether output metric result over training dataset.

setLabelCol(value)[source]
Parameters:

labelCol – label column name

setLambdaL1(value)[source]
Parameters:

lambdaL1 – L1 regularization

setLambdaL2(value)[source]
Parameters:

lambdaL2 – L2 regularization

setLeafPredictionCol(value)[source]
Parameters:

leafPredictionCol – Predicted leaf indices’s column name

setLearningRate(value)[source]
Parameters:

learningRate – Learning rate or shrinkage rate

setMatrixType(value)[source]
Parameters:

matrixType – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

setMaxBin(value)[source]
Parameters:

maxBin – Max bin

setMaxBinByFeature(value)[source]
Parameters:

maxBinByFeature – Max number of bins for each feature

setMaxCatThreshold(value)[source]
Parameters:

maxCatThreshold – limit number of split points considered for categorical features

setMaxCatToOnehot(value)[source]
Parameters:

maxCatToOnehot – when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used

setMaxDeltaStep(value)[source]
Parameters:

maxDeltaStep – Used to limit the max output of tree leaves

setMaxDepth(value)[source]
Parameters:

maxDepth – Max depth

setMaxDrop(value)[source]
Parameters:

maxDrop – Max number of dropped trees during one boosting iteration

setMaxNumClasses(value)[source]
Parameters:

maxNumClasses – Number of max classes to infer numClass in multi-class classification.

setMaxStreamingOMPThreads(value)[source]
Parameters:

maxStreamingOMPThreads – Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.

setMetric(value)[source]
Parameters:

metric – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

setMicroBatchSize(value)[source]
Parameters:

microBatchSize – Specify how many elements are sent in a streaming micro-batch.

setMinDataInLeaf(value)[source]
Parameters:

minDataInLeaf – Minimal number of data in one leaf. Can be used to deal with over-fitting.

setMinDataPerBin(value)[source]
Parameters:

minDataPerBin – Minimal number of data inside one bin

setMinDataPerGroup(value)[source]
Parameters:

minDataPerGroup – minimal number of data per categorical group

setMinGainToSplit(value)[source]
Parameters:

minGainToSplit – The minimal gain to perform split

setMinSumHessianInLeaf(value)[source]
Parameters:

minSumHessianInLeaf – Minimal sum hessian in one leaf

setModelString(value)[source]
Parameters:

modelString – LightGBM model to retrain

setMonotoneConstraints(value)[source]
Parameters:

monotoneConstraints – used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.

setMonotoneConstraintsMethod(value)[source]
Parameters:

monotoneConstraintsMethod – Monotone constraints method. basic, intermediate, or advanced.

setMonotonePenalty(value)[source]
Parameters:

monotonePenalty – A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.

setNegBaggingFraction(value)[source]
Parameters:

negBaggingFraction – Negative Bagging fraction

setNumBatches(value)[source]
Parameters:

numBatches – If greater than 0, splits data into separate batches during training

setNumIterations(value)[source]
Parameters:

numIterations – Number of iterations, LightGBM constructs num_class * num_iterations trees

setNumLeaves(value)[source]
Parameters:

numLeaves – Number of leaves

setNumTasks(value)[source]
Parameters:

numTasks – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

setNumThreads(value)[source]
Parameters:

numThreads – Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.

setObjective(value)[source]
Parameters:

objective – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

setObjectiveSeed(value)[source]
Parameters:

objectiveSeed – Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.

setOtherRate(value)[source]
Parameters:

otherRate – The retain ratio of small gradient data. Only used in goss.

setParallelism(value)[source]
Parameters:

parallelism – Tree learner parallelism, can be set to data_parallel or voting_parallel

setParams(alpha=0.9, baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', catSmooth=10.0, categoricalSlotIndexes=[], categoricalSlotNames=[], catl2=10.0, chunkSize=10000, dataRandomSeed=1, dataTransferMode='streaming', defaultListenPort=12400, deterministic=False, driverListenPort=0, dropRate=0.1, dropSeed=4, earlyStoppingRound=0, executionMode=None, extraSeed=6, featureFraction=1.0, featureFractionByNode=None, featureFractionSeed=2, featuresCol='features', featuresShapCol='', fobj=None, improvementTolerance=0.0, initScoreCol=None, isEnableSparse=True, isProvideTrainingMetric=False, labelCol='label', lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxCatThreshold=32, maxCatToOnehot=4, maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxNumClasses=100, maxStreamingOMPThreads=16, metric='', microBatchSize=100, minDataInLeaf=20, minDataPerBin=3, minDataPerGroup=100, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', monotoneConstraints=[], monotoneConstraintsMethod='basic', monotonePenalty=0.0, negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='regression', objectiveSeed=5, otherRate=0.1, parallelism='data_parallel', passThroughArgs='', posBaggingFraction=1.0, predictDisableShapeCheck=False, predictionCol='prediction', referenceDataset=None, repartitionByGroupingColumn=True, samplingMode='subset', samplingSubsetSize=1000000, seed=None, skipDrop=0.5, slotNames=[], timeout=1200.0, topK=20, topRate=0.2, tweedieVariancePower=1.5, uniformDrop=False, useBarrierExecutionMode=False, useMissing=True, useSingleDatasetMode=True, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xGBoostDartMode=False, zeroAsMissing=False)[source]

Set the (keyword only) parameters

setPassThroughArgs(value)[source]
Parameters:

passThroughArgs – Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true

setPosBaggingFraction(value)[source]
Parameters:

posBaggingFraction – Positive Bagging fraction

setPredictDisableShapeCheck(value)[source]
Parameters:

predictDisableShapeCheck – control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data

setPredictionCol(value)[source]
Parameters:

predictionCol – prediction column name

setReferenceDataset(value)[source]
Parameters:

referenceDataset – The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().

setRepartitionByGroupingColumn(value)[source]
Parameters:

repartitionByGroupingColumn – Repartition training data according to grouping column, on by default.

setSamplingMode(value)[source]
Parameters:

samplingMode – Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.

setSamplingSubsetSize(value)[source]
Parameters:

samplingSubsetSize – Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.

setSeed(value)[source]
Parameters:

seed – Main seed, used to generate other seeds

setSkipDrop(value)[source]
Parameters:

skipDrop – Probability of skipping the dropout procedure during a boosting iteration

setSlotNames(value)[source]
Parameters:

slotNames – List of slot names in the features column

setTimeout(value)[source]
Parameters:

timeout – Timeout in seconds

setTopK(value)[source]
Parameters:

topK – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

setTopRate(value)[source]
Parameters:

topRate – The retain ratio of large gradient data. Only used in goss.

setTweedieVariancePower(value)[source]
Parameters:

tweedieVariancePower – control the variance of tweedie distribution, must be between 1 and 2

setUniformDrop(value)[source]
Parameters:

uniformDrop – Set this to true to use uniform drop in dart mode

setUseBarrierExecutionMode(value)[source]
Parameters:

useBarrierExecutionMode – Barrier execution mode which uses a barrier stage, off by default.

setUseMissing(value)[source]
Parameters:

useMissing – Set this to false to disable the special handle of missing value

setUseSingleDatasetMode(value)[source]
Parameters:

useSingleDatasetMode – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.

setValidationIndicatorCol(value)[source]
Parameters:

validationIndicatorCol – Indicates whether the row is for training or validation

setVerbosity(value)[source]
Parameters:

verbosity – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

setWeightCol(value)[source]
Parameters:

weightCol – The name of the weight column

setXGBoostDartMode(value)[source]
Parameters:

xGBoostDartMode – Set this to true to use xgboost dart mode

setZeroAsMissing(value)[source]
Parameters:

zeroAsMissing – Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values

skipDrop = Param(parent='undefined', name='skipDrop', doc='Probability of skipping the dropout procedure during a boosting iteration')
slotNames = Param(parent='undefined', name='slotNames', doc='List of slot names in the features column')
timeout = Param(parent='undefined', name='timeout', doc='Timeout in seconds')
topK = Param(parent='undefined', name='topK', doc='The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0')
topRate = Param(parent='undefined', name='topRate', doc='The retain ratio of large gradient data. Only used in goss.')
tweedieVariancePower = Param(parent='undefined', name='tweedieVariancePower', doc='control the variance of tweedie distribution, must be between 1 and 2')
uniformDrop = Param(parent='undefined', name='uniformDrop', doc='Set this to true to use uniform drop in dart mode')
useBarrierExecutionMode = Param(parent='undefined', name='useBarrierExecutionMode', doc='Barrier execution mode which uses a barrier stage, off by default.')
useMissing = Param(parent='undefined', name='useMissing', doc='Set this to false to disable the special handle of missing value')
useSingleDatasetMode = Param(parent='undefined', name='useSingleDatasetMode', doc='Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.')
validationIndicatorCol = Param(parent='undefined', name='validationIndicatorCol', doc='Indicates whether the row is for training or validation')
verbosity = Param(parent='undefined', name='verbosity', doc='Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug')
weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')
xGBoostDartMode = Param(parent='undefined', name='xGBoostDartMode', doc='Set this to true to use xgboost dart mode')
zeroAsMissing = Param(parent='undefined', name='zeroAsMissing', doc='Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values')

synapse.ml.lightgbm.mixin module

class synapse.ml.lightgbm.mixin.LightGBMModelMixin[source]

Bases: object

getBoosterBestIteration()[source]

Get the best iteration from the booster.

Returns:

The best iteration, if early stopping was triggered.

getBoosterNumFeatures()[source]

Get the number of features from the booster.

Returns:

The number of features.

getBoosterNumTotalIterations()[source]

Get the total number of iterations trained.

Returns:

The total number of iterations trained.

getBoosterNumTotalModel()[source]

Get the total number of models trained.

Returns:

The total number of models.

getFeatureImportances(importance_type='split')[source]

Get the feature importances as a list. The importance_type can be “split” or “gain”.

getFeatureShaps(vector)[source]

Get the local shap feature importances.

getNativeModel()[source]

Get the native model serialized representation as a string.

saveNativeModel(filename, overwrite=True)[source]

Save the booster as string format to a local or WASB remote location.

setPredictDisableShapeCheck(value=None)[source]

Set shape check or not when predict.

Module contents

SynapseML is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark in several new directions. SynapseML adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV. These tools enable powerful and highly-scalable predictive and analytical models for a variety of datasources.

SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. In this vein, SynapseML provides easy to use SparkML transformers for a wide variety of Microsoft Cognitive Services. For production grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.

SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+.