synapse.ml.lightgbm package

Submodules

synapse.ml.lightgbm.LightGBMClassificationModel module

class synapse.ml.lightgbm.LightGBMClassificationModel.LightGBMClassificationModel(java_obj=None, actualNumClasses=None, featuresCol='features', featuresShapCol='', labelCol='label', leafPredictionCol='', lightGBMBooster=None, numIterations=- 1, predictDisableShapeCheck=False, predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', startIteration=0, thresholds=None)[source]

Bases: synapse.ml.lightgbm.mixin.LightGBMModelMixin, synapse.ml.lightgbm._LightGBMClassificationModel._LightGBMClassificationModel

getBoosterNumClasses()[source]

Get the number of classes from the booster.

Returns

The number of classes.

static loadNativeModelFromFile(filename)[source]

Load the model from a native LightGBM text file.

static loadNativeModelFromString(model)[source]

Load the model from a native LightGBM model string.

synapse.ml.lightgbm.LightGBMClassifier module

class synapse.ml.lightgbm.LightGBMClassifier.LightGBMClassifier(java_obj=None, baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', catSmooth=10.0, categoricalSlotIndexes=[], categoricalSlotNames=[], catl2=10.0, chunkSize=10000, dataRandomSeed=1, dataTransferMode='bulk', defaultListenPort=12400, deterministic=False, driverListenPort=0, dropRate=0.1, dropSeed=4, earlyStoppingRound=0, executionMode=None, extraSeed=6, featureFraction=1.0, featureFractionByNode=None, featureFractionSeed=2, featuresCol='features', featuresShapCol='', fobj=None, improvementTolerance=0.0, initScoreCol=None, isEnableSparse=True, isProvideTrainingMetric=False, isUnbalance=False, labelCol='label', lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxCatThreshold=32, maxCatToOnehot=4, maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxNumClasses=100, maxStreamingOMPThreads=16, metric='', microBatchSize=100, minDataInLeaf=20, minDataPerBin=3, minDataPerGroup=100, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', monotoneConstraints=[], monotoneConstraintsMethod='basic', monotonePenalty=0.0, negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='binary', objectiveSeed=5, otherRate=0.1, parallelism='data_parallel', passThroughArgs='', posBaggingFraction=1.0, predictDisableShapeCheck=False, predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', referenceDataset=None, repartitionByGroupingColumn=True, samplingMode='subset', samplingSubsetSize=1000000, seed=None, skipDrop=0.5, slotNames=[], thresholds=None, timeout=1200.0, topK=20, topRate=0.2, uniformDrop=False, useBarrierExecutionMode=False, useMissing=True, useSingleDatasetMode=True, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xGBoostDartMode=False, zeroAsMissing=False)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaEstimator

Parameters
  • baggingFraction (float) – Bagging fraction

  • baggingFreq (int) – Bagging frequency

  • baggingSeed (int) – Bagging seed

  • binSampleCount (int) – Number of samples considered at computing histogram bins

  • boostFromAverage (bool) – Adjusts initial score to the mean of labels for faster convergence

  • boostingType (str) – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

  • catSmooth (float) – this can reduce the effect of noises in categorical features, especially for categories with few data

  • categoricalSlotIndexes (list) – List of categorical column indexes, the slot index in the features column

  • categoricalSlotNames (list) – List of categorical column slot names, the slot name in the features column

  • catl2 (float) – L2 regularization in categorical split

  • chunkSize (int) – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

  • dataRandomSeed (int) – Random seed for sampling data to construct histogram bins.

  • dataTransferMode (str) – Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.

  • defaultListenPort (int) – The default listen port on executors, used for testing

  • deterministic (bool) – Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true

  • driverListenPort (int) – The listen port on a driver. Default value is 0 (random)

  • dropRate (float) – Dropout rate: a fraction of previous trees to drop during the dropout

  • dropSeed (int) – Random seed to choose dropping models. Only used in dart.

  • earlyStoppingRound (int) – Early stopping round

  • executionMode (str) – Deprecated. Please use dataTransferMode.

  • extraSeed (int) – Random seed for selecting threshold when extra_trees is true

  • featureFraction (float) – Feature fraction

  • featureFractionByNode (float) – Feature fraction by node

  • featureFractionSeed (int) – Feature fraction seed

  • featuresCol (str) – features column name

  • featuresShapCol (str) – Output SHAP vector column name after prediction containing the feature contribution values

  • fobj (object) – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

  • improvementTolerance (float) – Tolerance to consider improvement in metric

  • initScoreCol (str) – The name of the initial score column, used for continued training

  • isEnableSparse (bool) – Used to enable/disable sparse optimization

  • isProvideTrainingMetric (bool) – Whether output metric result over training dataset.

  • isUnbalance (bool) – Set to true if training data is unbalanced in binary classification scenario

  • labelCol (str) – label column name

  • lambdaL1 (float) – L1 regularization

  • lambdaL2 (float) – L2 regularization

  • leafPredictionCol (str) – Predicted leaf indices’s column name

  • learningRate (float) – Learning rate or shrinkage rate

  • matrixType (str) – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

  • maxBin (int) – Max bin

  • maxBinByFeature (list) – Max number of bins for each feature

  • maxCatThreshold (int) – limit number of split points considered for categorical features

  • maxCatToOnehot (int) – when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used

  • maxDeltaStep (float) – Used to limit the max output of tree leaves

  • maxDepth (int) – Max depth

  • maxDrop (int) – Max number of dropped trees during one boosting iteration

  • maxNumClasses (int) – Number of max classes to infer numClass in multi-class classification.

  • maxStreamingOMPThreads (int) – Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.

  • metric (str) – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

  • microBatchSize (int) – Specify how many elements are sent in a streaming micro-batch.

  • minDataInLeaf (int) – Minimal number of data in one leaf. Can be used to deal with over-fitting.

  • minDataPerBin (int) – Minimal number of data inside one bin

  • minDataPerGroup (int) – minimal number of data per categorical group

  • minGainToSplit (float) – The minimal gain to perform split

  • minSumHessianInLeaf (float) – Minimal sum hessian in one leaf

  • modelString (str) – LightGBM model to retrain

  • monotoneConstraints (list) – used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.

  • monotoneConstraintsMethod (str) – Monotone constraints method. basic, intermediate, or advanced.

  • monotonePenalty (float) – A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.

  • negBaggingFraction (float) – Negative Bagging fraction

  • numBatches (int) – If greater than 0, splits data into separate batches during training

  • numIterations (int) – Number of iterations, LightGBM constructs num_class * num_iterations trees

  • numLeaves (int) – Number of leaves

  • numTasks (int) – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

  • numThreads (int) – Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.

  • objective (str) – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

  • objectiveSeed (int) – Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.

  • otherRate (float) – The retain ratio of small gradient data. Only used in goss.

  • parallelism (str) – Tree learner parallelism, can be set to data_parallel or voting_parallel

  • passThroughArgs (str) – Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true

  • posBaggingFraction (float) – Positive Bagging fraction

  • predictDisableShapeCheck (bool) – control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data

  • predictionCol (str) – prediction column name

  • probabilityCol (str) – Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities

  • rawPredictionCol (str) – raw prediction (a.k.a. confidence) column name

  • referenceDataset (list) – The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().

  • repartitionByGroupingColumn (bool) – Repartition training data according to grouping column, on by default.

  • samplingMode (str) – Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.

  • samplingSubsetSize (int) – Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.

  • seed (int) – Main seed, used to generate other seeds

  • skipDrop (float) – Probability of skipping the dropout procedure during a boosting iteration

  • slotNames (list) – List of slot names in the features column

  • thresholds (list) – Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold

  • timeout (float) – Timeout in seconds

  • topK (int) – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

  • topRate (float) – The retain ratio of large gradient data. Only used in goss.

  • uniformDrop (bool) – Set this to true to use uniform drop in dart mode

  • useBarrierExecutionMode (bool) – Barrier execution mode which uses a barrier stage, off by default.

  • useMissing (bool) – Set this to false to disable the special handle of missing value

  • useSingleDatasetMode (bool) – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.

  • validationIndicatorCol (str) – Indicates whether the row is for training or validation

  • verbosity (int) – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

  • weightCol (str) – The name of the weight column

  • xGBoostDartMode (bool) – Set this to true to use xgboost dart mode

  • zeroAsMissing (bool) – Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values

baggingFraction = Param(parent='undefined', name='baggingFraction', doc='Bagging fraction')
baggingFreq = Param(parent='undefined', name='baggingFreq', doc='Bagging frequency')
baggingSeed = Param(parent='undefined', name='baggingSeed', doc='Bagging seed')
binSampleCount = Param(parent='undefined', name='binSampleCount', doc='Number of samples considered at computing histogram bins')
boostFromAverage = Param(parent='undefined', name='boostFromAverage', doc='Adjusts initial score to the mean of labels for faster convergence')
boostingType = Param(parent='undefined', name='boostingType', doc='Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling). ')
catSmooth = Param(parent='undefined', name='catSmooth', doc='this can reduce the effect of noises in categorical features, especially for categories with few data')
categoricalSlotIndexes = Param(parent='undefined', name='categoricalSlotIndexes', doc='List of categorical column indexes, the slot index in the features column')
categoricalSlotNames = Param(parent='undefined', name='categoricalSlotNames', doc='List of categorical column slot names, the slot name in the features column')
catl2 = Param(parent='undefined', name='catl2', doc='L2 regularization in categorical split')
chunkSize = Param(parent='undefined', name='chunkSize', doc='Advanced parameter to specify the chunk size for copying Java data to native.  If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.')
dataRandomSeed = Param(parent='undefined', name='dataRandomSeed', doc='Random seed for sampling data to construct histogram bins.')
dataTransferMode = Param(parent='undefined', name='dataTransferMode', doc='Specify how SynapseML transfers data from Spark to LightGBM.  Values can be streaming, bulk. Default is bulk, which is the legacy mode.')
defaultListenPort = Param(parent='undefined', name='defaultListenPort', doc='The default listen port on executors, used for testing')
deterministic = Param(parent='undefined', name='deterministic', doc='Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters.  Note: setting this to true may slow down training.  To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true')
driverListenPort = Param(parent='undefined', name='driverListenPort', doc='The listen port on a driver. Default value is 0 (random)')
dropRate = Param(parent='undefined', name='dropRate', doc='Dropout rate: a fraction of previous trees to drop during the dropout')
dropSeed = Param(parent='undefined', name='dropSeed', doc='Random seed to choose dropping models. Only used in dart.')
earlyStoppingRound = Param(parent='undefined', name='earlyStoppingRound', doc='Early stopping round')
executionMode = Param(parent='undefined', name='executionMode', doc='Deprecated. Please use dataTransferMode.')
extraSeed = Param(parent='undefined', name='extraSeed', doc='Random seed for selecting threshold when extra_trees is true')
featureFraction = Param(parent='undefined', name='featureFraction', doc='Feature fraction')
featureFractionByNode = Param(parent='undefined', name='featureFractionByNode', doc='Feature fraction by node')
featureFractionSeed = Param(parent='undefined', name='featureFractionSeed', doc='Feature fraction seed')
featuresCol = Param(parent='undefined', name='featuresCol', doc='features column name')
featuresShapCol = Param(parent='undefined', name='featuresShapCol', doc='Output SHAP vector column name after prediction containing the feature contribution values')
fobj = Param(parent='undefined', name='fobj', doc='Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).')
getBaggingFraction()[source]
Returns

Bagging fraction

Return type

baggingFraction

getBaggingFreq()[source]
Returns

Bagging frequency

Return type

baggingFreq

getBaggingSeed()[source]
Returns

Bagging seed

Return type

baggingSeed

getBinSampleCount()[source]
Returns

Number of samples considered at computing histogram bins

Return type

binSampleCount

getBoostFromAverage()[source]
Returns

Adjusts initial score to the mean of labels for faster convergence

Return type

boostFromAverage

getBoostingType()[source]
Returns

Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

Return type

boostingType

getCatSmooth()[source]
Returns

this can reduce the effect of noises in categorical features, especially for categories with few data

Return type

catSmooth

getCategoricalSlotIndexes()[source]
Returns

List of categorical column indexes, the slot index in the features column

Return type

categoricalSlotIndexes

getCategoricalSlotNames()[source]
Returns

List of categorical column slot names, the slot name in the features column

Return type

categoricalSlotNames

getCatl2()[source]
Returns

L2 regularization in categorical split

Return type

catl2

getChunkSize()[source]
Returns

Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

Return type

chunkSize

getDataRandomSeed()[source]
Returns

Random seed for sampling data to construct histogram bins.

Return type

dataRandomSeed

getDataTransferMode()[source]
Returns

Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.

Return type

dataTransferMode

getDefaultListenPort()[source]
Returns

The default listen port on executors, used for testing

Return type

defaultListenPort

getDeterministic()[source]
Returns

Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true

Return type

deterministic

getDriverListenPort()[source]
Returns

The listen port on a driver. Default value is 0 (random)

Return type

driverListenPort

getDropRate()[source]
Returns

Dropout rate: a fraction of previous trees to drop during the dropout

Return type

dropRate

getDropSeed()[source]
Returns

Random seed to choose dropping models. Only used in dart.

Return type

dropSeed

getEarlyStoppingRound()[source]
Returns

Early stopping round

Return type

earlyStoppingRound

getExecutionMode()[source]
Returns

Deprecated. Please use dataTransferMode.

Return type

executionMode

getExtraSeed()[source]
Returns

Random seed for selecting threshold when extra_trees is true

Return type

extraSeed

getFeatureFraction()[source]
Returns

Feature fraction

Return type

featureFraction

getFeatureFractionByNode()[source]
Returns

Feature fraction by node

Return type

featureFractionByNode

getFeatureFractionSeed()[source]
Returns

Feature fraction seed

Return type

featureFractionSeed

getFeaturesCol()[source]
Returns

features column name

Return type

featuresCol

getFeaturesShapCol()[source]
Returns

Output SHAP vector column name after prediction containing the feature contribution values

Return type

featuresShapCol

getFobj()[source]
Returns

Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

Return type

fobj

getImprovementTolerance()[source]
Returns

Tolerance to consider improvement in metric

Return type

improvementTolerance

getInitScoreCol()[source]
Returns

The name of the initial score column, used for continued training

Return type

initScoreCol

getIsEnableSparse()[source]
Returns

Used to enable/disable sparse optimization

Return type

isEnableSparse

getIsProvideTrainingMetric()[source]
Returns

Whether output metric result over training dataset.

Return type

isProvideTrainingMetric

getIsUnbalance()[source]
Returns

Set to true if training data is unbalanced in binary classification scenario

Return type

isUnbalance

static getJavaPackage()[source]

Returns package name String.

getLabelCol()[source]
Returns

label column name

Return type

labelCol

getLambdaL1()[source]
Returns

L1 regularization

Return type

lambdaL1

getLambdaL2()[source]
Returns

L2 regularization

Return type

lambdaL2

getLeafPredictionCol()[source]
Returns

Predicted leaf indices’s column name

Return type

leafPredictionCol

getLearningRate()[source]
Returns

Learning rate or shrinkage rate

Return type

learningRate

getMatrixType()[source]
Returns

Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

Return type

matrixType

getMaxBin()[source]
Returns

Max bin

Return type

maxBin

getMaxBinByFeature()[source]
Returns

Max number of bins for each feature

Return type

maxBinByFeature

getMaxCatThreshold()[source]
Returns

limit number of split points considered for categorical features

Return type

maxCatThreshold

getMaxCatToOnehot()[source]
Returns

when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used

Return type

maxCatToOnehot

getMaxDeltaStep()[source]
Returns

Used to limit the max output of tree leaves

Return type

maxDeltaStep

getMaxDepth()[source]
Returns

Max depth

Return type

maxDepth

getMaxDrop()[source]
Returns

Max number of dropped trees during one boosting iteration

Return type

maxDrop

getMaxNumClasses()[source]
Returns

Number of max classes to infer numClass in multi-class classification.

Return type

maxNumClasses

getMaxStreamingOMPThreads()[source]
Returns

Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.

Return type

maxStreamingOMPThreads

getMetric()[source]
Returns

Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

Return type

metric

getMicroBatchSize()[source]
Returns

Specify how many elements are sent in a streaming micro-batch.

Return type

microBatchSize

getMinDataInLeaf()[source]
Returns

Minimal number of data in one leaf. Can be used to deal with over-fitting.

Return type

minDataInLeaf

getMinDataPerBin()[source]
Returns

Minimal number of data inside one bin

Return type

minDataPerBin

getMinDataPerGroup()[source]
Returns

minimal number of data per categorical group

Return type

minDataPerGroup

getMinGainToSplit()[source]
Returns

The minimal gain to perform split

Return type

minGainToSplit

getMinSumHessianInLeaf()[source]
Returns

Minimal sum hessian in one leaf

Return type

minSumHessianInLeaf

getModelString()[source]
Returns

LightGBM model to retrain

Return type

modelString

getMonotoneConstraints()[source]
Returns

used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.

Return type

monotoneConstraints

getMonotoneConstraintsMethod()[source]
Returns

Monotone constraints method. basic, intermediate, or advanced.

Return type

monotoneConstraintsMethod

getMonotonePenalty()[source]
Returns

A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.

Return type

monotonePenalty

getNegBaggingFraction()[source]
Returns

Negative Bagging fraction

Return type

negBaggingFraction

getNumBatches()[source]
Returns

If greater than 0, splits data into separate batches during training

Return type

numBatches

getNumIterations()[source]
Returns

Number of iterations, LightGBM constructs num_class * num_iterations trees

Return type

numIterations

getNumLeaves()[source]
Returns

Number of leaves

Return type

numLeaves

getNumTasks()[source]
Returns

Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

Return type

numTasks

getNumThreads()[source]
Returns

Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.

Return type

numThreads

getObjective()[source]
Returns

The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

Return type

objective

getObjectiveSeed()[source]
Returns

Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.

Return type

objectiveSeed

getOtherRate()[source]
Returns

The retain ratio of small gradient data. Only used in goss.

Return type

otherRate

getParallelism()[source]
Returns

Tree learner parallelism, can be set to data_parallel or voting_parallel

Return type

parallelism

getPassThroughArgs()[source]
Returns

Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true

Return type

passThroughArgs

getPosBaggingFraction()[source]
Returns

Positive Bagging fraction

Return type

posBaggingFraction

getPredictDisableShapeCheck()[source]
Returns

control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data

Return type

predictDisableShapeCheck

getPredictionCol()[source]
Returns

prediction column name

Return type

predictionCol

getProbabilityCol()[source]
Returns

Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities

Return type

probabilityCol

getRawPredictionCol()[source]
Returns

raw prediction (a.k.a. confidence) column name

Return type

rawPredictionCol

getReferenceDataset()[source]
Returns

The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().

Return type

referenceDataset

getRepartitionByGroupingColumn()[source]
Returns

Repartition training data according to grouping column, on by default.

Return type

repartitionByGroupingColumn

getSamplingMode()[source]
Returns

Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.

Return type

samplingMode

getSamplingSubsetSize()[source]
Returns

Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.

Return type

samplingSubsetSize

getSeed()[source]
Returns

Main seed, used to generate other seeds

Return type

seed

getSkipDrop()[source]
Returns

Probability of skipping the dropout procedure during a boosting iteration

Return type

skipDrop

getSlotNames()[source]
Returns

List of slot names in the features column

Return type

slotNames

getThresholds()[source]
Returns

Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold

Return type

thresholds

getTimeout()[source]
Returns

Timeout in seconds

Return type

timeout

getTopK()[source]
Returns

The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

Return type

topK

getTopRate()[source]
Returns

The retain ratio of large gradient data. Only used in goss.

Return type

topRate

getUniformDrop()[source]
Returns

Set this to true to use uniform drop in dart mode

Return type

uniformDrop

getUseBarrierExecutionMode()[source]
Returns

Barrier execution mode which uses a barrier stage, off by default.

Return type

useBarrierExecutionMode

getUseMissing()[source]
Returns

Set this to false to disable the special handle of missing value

Return type

useMissing

getUseSingleDatasetMode()[source]
Returns

Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.

Return type

useSingleDatasetMode

getValidationIndicatorCol()[source]
Returns

Indicates whether the row is for training or validation

Return type

validationIndicatorCol

getVerbosity()[source]
Returns

Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

Return type

verbosity

getWeightCol()[source]
Returns

The name of the weight column

Return type

weightCol

getXGBoostDartMode()[source]
Returns

Set this to true to use xgboost dart mode

Return type

xGBoostDartMode

getZeroAsMissing()[source]
Returns

Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values

Return type

zeroAsMissing

improvementTolerance = Param(parent='undefined', name='improvementTolerance', doc='Tolerance to consider improvement in metric')
initScoreCol = Param(parent='undefined', name='initScoreCol', doc='The name of the initial score column, used for continued training')
isEnableSparse = Param(parent='undefined', name='isEnableSparse', doc='Used to enable/disable sparse optimization')
isProvideTrainingMetric = Param(parent='undefined', name='isProvideTrainingMetric', doc='Whether output metric result over training dataset.')
isUnbalance = Param(parent='undefined', name='isUnbalance', doc='Set to true if training data is unbalanced in binary classification scenario')
labelCol = Param(parent='undefined', name='labelCol', doc='label column name')
lambdaL1 = Param(parent='undefined', name='lambdaL1', doc='L1 regularization')
lambdaL2 = Param(parent='undefined', name='lambdaL2', doc='L2 regularization')
leafPredictionCol = Param(parent='undefined', name='leafPredictionCol', doc="Predicted leaf indices's column name")
learningRate = Param(parent='undefined', name='learningRate', doc='Learning rate or shrinkage rate')
matrixType = Param(parent='undefined', name='matrixType', doc='Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense.  Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.')
maxBin = Param(parent='undefined', name='maxBin', doc='Max bin')
maxBinByFeature = Param(parent='undefined', name='maxBinByFeature', doc='Max number of bins for each feature')
maxCatThreshold = Param(parent='undefined', name='maxCatThreshold', doc='limit number of split points considered for categorical features')
maxCatToOnehot = Param(parent='undefined', name='maxCatToOnehot', doc='when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used')
maxDeltaStep = Param(parent='undefined', name='maxDeltaStep', doc='Used to limit the max output of tree leaves')
maxDepth = Param(parent='undefined', name='maxDepth', doc='Max depth')
maxDrop = Param(parent='undefined', name='maxDrop', doc='Max number of dropped trees during one boosting iteration')
maxNumClasses = Param(parent='undefined', name='maxNumClasses', doc='Number of max classes to infer numClass in multi-class classification.')
maxStreamingOMPThreads = Param(parent='undefined', name='maxStreamingOMPThreads', doc="Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it's best to set a fixed value.")
metric = Param(parent='undefined', name='metric', doc='Metrics to be evaluated on the evaluation data.  Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv. ')
microBatchSize = Param(parent='undefined', name='microBatchSize', doc='Specify how many elements are sent in a streaming micro-batch.')
minDataInLeaf = Param(parent='undefined', name='minDataInLeaf', doc='Minimal number of data in one leaf. Can be used to deal with over-fitting.')
minDataPerBin = Param(parent='undefined', name='minDataPerBin', doc='Minimal number of data inside one bin')
minDataPerGroup = Param(parent='undefined', name='minDataPerGroup', doc='minimal number of data per categorical group')
minGainToSplit = Param(parent='undefined', name='minGainToSplit', doc='The minimal gain to perform split')
minSumHessianInLeaf = Param(parent='undefined', name='minSumHessianInLeaf', doc='Minimal sum hessian in one leaf')
modelString = Param(parent='undefined', name='modelString', doc='LightGBM model to retrain')
monotoneConstraints = Param(parent='undefined', name='monotoneConstraints', doc='used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.')
monotoneConstraintsMethod = Param(parent='undefined', name='monotoneConstraintsMethod', doc='Monotone constraints method. basic, intermediate, or advanced.')
monotonePenalty = Param(parent='undefined', name='monotonePenalty', doc='A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.')
negBaggingFraction = Param(parent='undefined', name='negBaggingFraction', doc='Negative Bagging fraction')
numBatches = Param(parent='undefined', name='numBatches', doc='If greater than 0, splits data into separate batches during training')
numIterations = Param(parent='undefined', name='numIterations', doc='Number of iterations, LightGBM constructs num_class * num_iterations trees')
numLeaves = Param(parent='undefined', name='numLeaves', doc='Number of leaves')
numTasks = Param(parent='undefined', name='numTasks', doc='Advanced parameter to specify the number of tasks.  SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.')
numThreads = Param(parent='undefined', name='numThreads', doc='Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.')
objective = Param(parent='undefined', name='objective', doc='The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova. ')
objectiveSeed = Param(parent='undefined', name='objectiveSeed', doc='Random seed for objectives, if random process is needed.  Currently used only for rank_xendcg objective.')
otherRate = Param(parent='undefined', name='otherRate', doc='The retain ratio of small gradient data. Only used in goss.')
parallelism = Param(parent='undefined', name='parallelism', doc='Tree learner parallelism, can be set to data_parallel or voting_parallel')
passThroughArgs = Param(parent='undefined', name='passThroughArgs', doc='Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true')
posBaggingFraction = Param(parent='undefined', name='posBaggingFraction', doc='Positive Bagging fraction')
predictDisableShapeCheck = Param(parent='undefined', name='predictDisableShapeCheck', doc='control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data')
predictionCol = Param(parent='undefined', name='predictionCol', doc='prediction column name')
probabilityCol = Param(parent='undefined', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities')
rawPredictionCol = Param(parent='undefined', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name')
classmethod read()[source]

Returns an MLReader instance for this class.

referenceDataset = Param(parent='undefined', name='referenceDataset', doc='The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().')
repartitionByGroupingColumn = Param(parent='undefined', name='repartitionByGroupingColumn', doc='Repartition training data according to grouping column, on by default.')
samplingMode = Param(parent='undefined', name='samplingMode', doc="Data sampling for streaming mode. Sampled data is used to define bins. 'global': sample from all data, 'subset': sample from first N rows, or 'fixed': Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.")
samplingSubsetSize = Param(parent='undefined', name='samplingSubsetSize', doc="Specify subset size N for the sampling mode 'subset'. 'binSampleCount' rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.")
seed = Param(parent='undefined', name='seed', doc='Main seed, used to generate other seeds')
setBaggingFraction(value)[source]
Parameters

baggingFraction – Bagging fraction

setBaggingFreq(value)[source]
Parameters

baggingFreq – Bagging frequency

setBaggingSeed(value)[source]
Parameters

baggingSeed – Bagging seed

setBinSampleCount(value)[source]
Parameters

binSampleCount – Number of samples considered at computing histogram bins

setBoostFromAverage(value)[source]
Parameters

boostFromAverage – Adjusts initial score to the mean of labels for faster convergence

setBoostingType(value)[source]
Parameters

boostingType – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

setCatSmooth(value)[source]
Parameters

catSmooth – this can reduce the effect of noises in categorical features, especially for categories with few data

setCategoricalSlotIndexes(value)[source]
Parameters

categoricalSlotIndexes – List of categorical column indexes, the slot index in the features column

setCategoricalSlotNames(value)[source]
Parameters

categoricalSlotNames – List of categorical column slot names, the slot name in the features column

setCatl2(value)[source]
Parameters

catl2 – L2 regularization in categorical split

setChunkSize(value)[source]
Parameters

chunkSize – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

setDataRandomSeed(value)[source]
Parameters

dataRandomSeed – Random seed for sampling data to construct histogram bins.

setDataTransferMode(value)[source]
Parameters

dataTransferMode – Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.

setDefaultListenPort(value)[source]
Parameters

defaultListenPort – The default listen port on executors, used for testing

setDeterministic(value)[source]
Parameters

deterministic – Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true

setDriverListenPort(value)[source]
Parameters

driverListenPort – The listen port on a driver. Default value is 0 (random)

setDropRate(value)[source]
Parameters

dropRate – Dropout rate: a fraction of previous trees to drop during the dropout

setDropSeed(value)[source]
Parameters

dropSeed – Random seed to choose dropping models. Only used in dart.

setEarlyStoppingRound(value)[source]
Parameters

earlyStoppingRound – Early stopping round

setExecutionMode(value)[source]
Parameters

executionMode – Deprecated. Please use dataTransferMode.

setExtraSeed(value)[source]
Parameters

extraSeed – Random seed for selecting threshold when extra_trees is true

setFeatureFraction(value)[source]
Parameters

featureFraction – Feature fraction

setFeatureFractionByNode(value)[source]
Parameters

featureFractionByNode – Feature fraction by node

setFeatureFractionSeed(value)[source]
Parameters

featureFractionSeed – Feature fraction seed

setFeaturesCol(value)[source]
Parameters

featuresCol – features column name

setFeaturesShapCol(value)[source]
Parameters

featuresShapCol – Output SHAP vector column name after prediction containing the feature contribution values

setFobj(value)[source]
Parameters

fobj – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

setImprovementTolerance(value)[source]
Parameters

improvementTolerance – Tolerance to consider improvement in metric

setInitScoreCol(value)[source]
Parameters

initScoreCol – The name of the initial score column, used for continued training

setIsEnableSparse(value)[source]
Parameters

isEnableSparse – Used to enable/disable sparse optimization

setIsProvideTrainingMetric(value)[source]
Parameters

isProvideTrainingMetric – Whether output metric result over training dataset.

setIsUnbalance(value)[source]
Parameters

isUnbalance – Set to true if training data is unbalanced in binary classification scenario

setLabelCol(value)[source]
Parameters

labelCol – label column name

setLambdaL1(value)[source]
Parameters

lambdaL1 – L1 regularization

setLambdaL2(value)[source]
Parameters

lambdaL2 – L2 regularization

setLeafPredictionCol(value)[source]
Parameters

leafPredictionCol – Predicted leaf indices’s column name

setLearningRate(value)[source]
Parameters

learningRate – Learning rate or shrinkage rate

setMatrixType(value)[source]
Parameters

matrixType – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

setMaxBin(value)[source]
Parameters

maxBin – Max bin

setMaxBinByFeature(value)[source]
Parameters

maxBinByFeature – Max number of bins for each feature

setMaxCatThreshold(value)[source]
Parameters

maxCatThreshold – limit number of split points considered for categorical features

setMaxCatToOnehot(value)[source]
Parameters

maxCatToOnehot – when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used

setMaxDeltaStep(value)[source]
Parameters

maxDeltaStep – Used to limit the max output of tree leaves

setMaxDepth(value)[source]
Parameters

maxDepth – Max depth

setMaxDrop(value)[source]
Parameters

maxDrop – Max number of dropped trees during one boosting iteration

setMaxNumClasses(value)[source]
Parameters

maxNumClasses – Number of max classes to infer numClass in multi-class classification.

setMaxStreamingOMPThreads(value)[source]
Parameters

maxStreamingOMPThreads – Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.

setMetric(value)[source]
Parameters

metric – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

setMicroBatchSize(value)[source]
Parameters

microBatchSize – Specify how many elements are sent in a streaming micro-batch.

setMinDataInLeaf(value)[source]
Parameters

minDataInLeaf – Minimal number of data in one leaf. Can be used to deal with over-fitting.

setMinDataPerBin(value)[source]
Parameters

minDataPerBin – Minimal number of data inside one bin

setMinDataPerGroup(value)[source]
Parameters

minDataPerGroup – minimal number of data per categorical group

setMinGainToSplit(value)[source]
Parameters

minGainToSplit – The minimal gain to perform split

setMinSumHessianInLeaf(value)[source]
Parameters

minSumHessianInLeaf – Minimal sum hessian in one leaf

setModelString(value)[source]
Parameters

modelString – LightGBM model to retrain

setMonotoneConstraints(value)[source]
Parameters

monotoneConstraints – used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.

setMonotoneConstraintsMethod(value)[source]
Parameters

monotoneConstraintsMethod – Monotone constraints method. basic, intermediate, or advanced.

setMonotonePenalty(value)[source]
Parameters

monotonePenalty – A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.

setNegBaggingFraction(value)[source]
Parameters

negBaggingFraction – Negative Bagging fraction

setNumBatches(value)[source]
Parameters

numBatches – If greater than 0, splits data into separate batches during training

setNumIterations(value)[source]
Parameters

numIterations – Number of iterations, LightGBM constructs num_class * num_iterations trees

setNumLeaves(value)[source]
Parameters

numLeaves – Number of leaves

setNumTasks(value)[source]
Parameters

numTasks – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

setNumThreads(value)[source]
Parameters

numThreads – Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.

setObjective(value)[source]
Parameters

objective – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

setObjectiveSeed(value)[source]
Parameters

objectiveSeed – Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.

setOtherRate(value)[source]
Parameters

otherRate – The retain ratio of small gradient data. Only used in goss.

setParallelism(value)[source]
Parameters

parallelism – Tree learner parallelism, can be set to data_parallel or voting_parallel

setParams(baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', catSmooth=10.0, categoricalSlotIndexes=[], categoricalSlotNames=[], catl2=10.0, chunkSize=10000, dataRandomSeed=1, dataTransferMode='bulk', defaultListenPort=12400, deterministic=False, driverListenPort=0, dropRate=0.1, dropSeed=4, earlyStoppingRound=0, executionMode=None, extraSeed=6, featureFraction=1.0, featureFractionByNode=None, featureFractionSeed=2, featuresCol='features', featuresShapCol='', fobj=None, improvementTolerance=0.0, initScoreCol=None, isEnableSparse=True, isProvideTrainingMetric=False, isUnbalance=False, labelCol='label', lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxCatThreshold=32, maxCatToOnehot=4, maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxNumClasses=100, maxStreamingOMPThreads=16, metric='', microBatchSize=100, minDataInLeaf=20, minDataPerBin=3, minDataPerGroup=100, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', monotoneConstraints=[], monotoneConstraintsMethod='basic', monotonePenalty=0.0, negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='binary', objectiveSeed=5, otherRate=0.1, parallelism='data_parallel', passThroughArgs='', posBaggingFraction=1.0, predictDisableShapeCheck=False, predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', referenceDataset=None, repartitionByGroupingColumn=True, samplingMode='subset', samplingSubsetSize=1000000, seed=None, skipDrop=0.5, slotNames=[], thresholds=None, timeout=1200.0, topK=20, topRate=0.2, uniformDrop=False, useBarrierExecutionMode=False, useMissing=True, useSingleDatasetMode=True, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xGBoostDartMode=False, zeroAsMissing=False)[source]

Set the (keyword only) parameters

setPassThroughArgs(value)[source]
Parameters

passThroughArgs – Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true

setPosBaggingFraction(value)[source]
Parameters

posBaggingFraction – Positive Bagging fraction

setPredictDisableShapeCheck(value)[source]
Parameters

predictDisableShapeCheck – control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data

setPredictionCol(value)[source]
Parameters

predictionCol – prediction column name

setProbabilityCol(value)[source]
Parameters

probabilityCol – Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities

setRawPredictionCol(value)[source]
Parameters

rawPredictionCol – raw prediction (a.k.a. confidence) column name

setReferenceDataset(value)[source]
Parameters

referenceDataset – The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().

setRepartitionByGroupingColumn(value)[source]
Parameters

repartitionByGroupingColumn – Repartition training data according to grouping column, on by default.

setSamplingMode(value)[source]
Parameters

samplingMode – Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.

setSamplingSubsetSize(value)[source]
Parameters

samplingSubsetSize – Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.

setSeed(value)[source]
Parameters

seed – Main seed, used to generate other seeds

setSkipDrop(value)[source]
Parameters

skipDrop – Probability of skipping the dropout procedure during a boosting iteration

setSlotNames(value)[source]
Parameters

slotNames – List of slot names in the features column

setThresholds(value)[source]
Parameters

thresholds – Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold

setTimeout(value)[source]
Parameters

timeout – Timeout in seconds

setTopK(value)[source]
Parameters

topK – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

setTopRate(value)[source]
Parameters

topRate – The retain ratio of large gradient data. Only used in goss.

setUniformDrop(value)[source]
Parameters

uniformDrop – Set this to true to use uniform drop in dart mode

setUseBarrierExecutionMode(value)[source]
Parameters

useBarrierExecutionMode – Barrier execution mode which uses a barrier stage, off by default.

setUseMissing(value)[source]
Parameters

useMissing – Set this to false to disable the special handle of missing value

setUseSingleDatasetMode(value)[source]
Parameters

useSingleDatasetMode – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.

setValidationIndicatorCol(value)[source]
Parameters

validationIndicatorCol – Indicates whether the row is for training or validation

setVerbosity(value)[source]
Parameters

verbosity – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

setWeightCol(value)[source]
Parameters

weightCol – The name of the weight column

setXGBoostDartMode(value)[source]
Parameters

xGBoostDartMode – Set this to true to use xgboost dart mode

setZeroAsMissing(value)[source]
Parameters

zeroAsMissing – Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values

skipDrop = Param(parent='undefined', name='skipDrop', doc='Probability of skipping the dropout procedure during a boosting iteration')
slotNames = Param(parent='undefined', name='slotNames', doc='List of slot names in the features column')
thresholds = Param(parent='undefined', name='thresholds', doc="Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold")
timeout = Param(parent='undefined', name='timeout', doc='Timeout in seconds')
topK = Param(parent='undefined', name='topK', doc='The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0')
topRate = Param(parent='undefined', name='topRate', doc='The retain ratio of large gradient data. Only used in goss.')
uniformDrop = Param(parent='undefined', name='uniformDrop', doc='Set this to true to use uniform drop in dart mode')
useBarrierExecutionMode = Param(parent='undefined', name='useBarrierExecutionMode', doc='Barrier execution mode which uses a barrier stage, off by default.')
useMissing = Param(parent='undefined', name='useMissing', doc='Set this to false to disable the special handle of missing value')
useSingleDatasetMode = Param(parent='undefined', name='useSingleDatasetMode', doc='Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.')
validationIndicatorCol = Param(parent='undefined', name='validationIndicatorCol', doc='Indicates whether the row is for training or validation')
verbosity = Param(parent='undefined', name='verbosity', doc='Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug')
weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')
xGBoostDartMode = Param(parent='undefined', name='xGBoostDartMode', doc='Set this to true to use xgboost dart mode')
zeroAsMissing = Param(parent='undefined', name='zeroAsMissing', doc='Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values')

synapse.ml.lightgbm.LightGBMRanker module

class synapse.ml.lightgbm.LightGBMRanker.LightGBMRanker(java_obj=None, baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', catSmooth=10.0, categoricalSlotIndexes=[], categoricalSlotNames=[], catl2=10.0, chunkSize=10000, dataRandomSeed=1, dataTransferMode='bulk', defaultListenPort=12400, deterministic=False, driverListenPort=0, dropRate=0.1, dropSeed=4, earlyStoppingRound=0, evalAt=[1, 2, 3, 4, 5], executionMode=None, extraSeed=6, featureFraction=1.0, featureFractionByNode=None, featureFractionSeed=2, featuresCol='features', featuresShapCol='', fobj=None, groupCol=None, improvementTolerance=0.0, initScoreCol=None, isEnableSparse=True, isProvideTrainingMetric=False, labelCol='label', labelGain=[], lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxCatThreshold=32, maxCatToOnehot=4, maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxNumClasses=100, maxPosition=20, maxStreamingOMPThreads=16, metric='', microBatchSize=100, minDataInLeaf=20, minDataPerBin=3, minDataPerGroup=100, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', monotoneConstraints=[], monotoneConstraintsMethod='basic', monotonePenalty=0.0, negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='lambdarank', objectiveSeed=5, otherRate=0.1, parallelism='data_parallel', passThroughArgs='', posBaggingFraction=1.0, predictDisableShapeCheck=False, predictionCol='prediction', referenceDataset=None, repartitionByGroupingColumn=True, samplingMode='subset', samplingSubsetSize=1000000, seed=None, skipDrop=0.5, slotNames=[], timeout=1200.0, topK=20, topRate=0.2, uniformDrop=False, useBarrierExecutionMode=False, useMissing=True, useSingleDatasetMode=True, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xGBoostDartMode=False, zeroAsMissing=False)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaEstimator

Parameters
  • baggingFraction (float) – Bagging fraction

  • baggingFreq (int) – Bagging frequency

  • baggingSeed (int) – Bagging seed

  • binSampleCount (int) – Number of samples considered at computing histogram bins

  • boostFromAverage (bool) – Adjusts initial score to the mean of labels for faster convergence

  • boostingType (str) – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

  • catSmooth (float) – this can reduce the effect of noises in categorical features, especially for categories with few data

  • categoricalSlotIndexes (list) – List of categorical column indexes, the slot index in the features column

  • categoricalSlotNames (list) – List of categorical column slot names, the slot name in the features column

  • catl2 (float) – L2 regularization in categorical split

  • chunkSize (int) – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

  • dataRandomSeed (int) – Random seed for sampling data to construct histogram bins.

  • dataTransferMode (str) – Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.

  • defaultListenPort (int) – The default listen port on executors, used for testing

  • deterministic (bool) – Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true

  • driverListenPort (int) – The listen port on a driver. Default value is 0 (random)

  • dropRate (float) – Dropout rate: a fraction of previous trees to drop during the dropout

  • dropSeed (int) – Random seed to choose dropping models. Only used in dart.

  • earlyStoppingRound (int) – Early stopping round

  • evalAt (list) – NDCG and MAP evaluation positions, separated by comma

  • executionMode (str) – Deprecated. Please use dataTransferMode.

  • extraSeed (int) – Random seed for selecting threshold when extra_trees is true

  • featureFraction (float) – Feature fraction

  • featureFractionByNode (float) – Feature fraction by node

  • featureFractionSeed (int) – Feature fraction seed

  • featuresCol (str) – features column name

  • featuresShapCol (str) – Output SHAP vector column name after prediction containing the feature contribution values

  • fobj (object) – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

  • groupCol (str) – The name of the group column

  • improvementTolerance (float) – Tolerance to consider improvement in metric

  • initScoreCol (str) – The name of the initial score column, used for continued training

  • isEnableSparse (bool) – Used to enable/disable sparse optimization

  • isProvideTrainingMetric (bool) – Whether output metric result over training dataset.

  • labelCol (str) – label column name

  • labelGain (list) – graded relevance for each label in NDCG

  • lambdaL1 (float) – L1 regularization

  • lambdaL2 (float) – L2 regularization

  • leafPredictionCol (str) – Predicted leaf indices’s column name

  • learningRate (float) – Learning rate or shrinkage rate

  • matrixType (str) – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

  • maxBin (int) – Max bin

  • maxBinByFeature (list) – Max number of bins for each feature

  • maxCatThreshold (int) – limit number of split points considered for categorical features

  • maxCatToOnehot (int) – when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used

  • maxDeltaStep (float) – Used to limit the max output of tree leaves

  • maxDepth (int) – Max depth

  • maxDrop (int) – Max number of dropped trees during one boosting iteration

  • maxNumClasses (int) – Number of max classes to infer numClass in multi-class classification.

  • maxPosition (int) – optimized NDCG at this position

  • maxStreamingOMPThreads (int) – Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.

  • metric (str) – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

  • microBatchSize (int) – Specify how many elements are sent in a streaming micro-batch.

  • minDataInLeaf (int) – Minimal number of data in one leaf. Can be used to deal with over-fitting.

  • minDataPerBin (int) – Minimal number of data inside one bin

  • minDataPerGroup (int) – minimal number of data per categorical group

  • minGainToSplit (float) – The minimal gain to perform split

  • minSumHessianInLeaf (float) – Minimal sum hessian in one leaf

  • modelString (str) – LightGBM model to retrain

  • monotoneConstraints (list) – used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.

  • monotoneConstraintsMethod (str) – Monotone constraints method. basic, intermediate, or advanced.

  • monotonePenalty (float) – A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.

  • negBaggingFraction (float) – Negative Bagging fraction

  • numBatches (int) – If greater than 0, splits data into separate batches during training

  • numIterations (int) – Number of iterations, LightGBM constructs num_class * num_iterations trees

  • numLeaves (int) – Number of leaves

  • numTasks (int) – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

  • numThreads (int) – Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.

  • objective (str) – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

  • objectiveSeed (int) – Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.

  • otherRate (float) – The retain ratio of small gradient data. Only used in goss.

  • parallelism (str) – Tree learner parallelism, can be set to data_parallel or voting_parallel

  • passThroughArgs (str) – Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true

  • posBaggingFraction (float) – Positive Bagging fraction

  • predictDisableShapeCheck (bool) – control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data

  • predictionCol (str) – prediction column name

  • referenceDataset (list) – The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().

  • repartitionByGroupingColumn (bool) – Repartition training data according to grouping column, on by default.

  • samplingMode (str) – Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.

  • samplingSubsetSize (int) – Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.

  • seed (int) – Main seed, used to generate other seeds

  • skipDrop (float) – Probability of skipping the dropout procedure during a boosting iteration

  • slotNames (list) – List of slot names in the features column

  • timeout (float) – Timeout in seconds

  • topK (int) – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

  • topRate (float) – The retain ratio of large gradient data. Only used in goss.

  • uniformDrop (bool) – Set this to true to use uniform drop in dart mode

  • useBarrierExecutionMode (bool) – Barrier execution mode which uses a barrier stage, off by default.

  • useMissing (bool) – Set this to false to disable the special handle of missing value

  • useSingleDatasetMode (bool) – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.

  • validationIndicatorCol (str) – Indicates whether the row is for training or validation

  • verbosity (int) – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

  • weightCol (str) – The name of the weight column

  • xGBoostDartMode (bool) – Set this to true to use xgboost dart mode

  • zeroAsMissing (bool) – Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values

baggingFraction = Param(parent='undefined', name='baggingFraction', doc='Bagging fraction')
baggingFreq = Param(parent='undefined', name='baggingFreq', doc='Bagging frequency')
baggingSeed = Param(parent='undefined', name='baggingSeed', doc='Bagging seed')
binSampleCount = Param(parent='undefined', name='binSampleCount', doc='Number of samples considered at computing histogram bins')
boostFromAverage = Param(parent='undefined', name='boostFromAverage', doc='Adjusts initial score to the mean of labels for faster convergence')
boostingType = Param(parent='undefined', name='boostingType', doc='Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling). ')
catSmooth = Param(parent='undefined', name='catSmooth', doc='this can reduce the effect of noises in categorical features, especially for categories with few data')
categoricalSlotIndexes = Param(parent='undefined', name='categoricalSlotIndexes', doc='List of categorical column indexes, the slot index in the features column')
categoricalSlotNames = Param(parent='undefined', name='categoricalSlotNames', doc='List of categorical column slot names, the slot name in the features column')
catl2 = Param(parent='undefined', name='catl2', doc='L2 regularization in categorical split')
chunkSize = Param(parent='undefined', name='chunkSize', doc='Advanced parameter to specify the chunk size for copying Java data to native.  If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.')
dataRandomSeed = Param(parent='undefined', name='dataRandomSeed', doc='Random seed for sampling data to construct histogram bins.')
dataTransferMode = Param(parent='undefined', name='dataTransferMode', doc='Specify how SynapseML transfers data from Spark to LightGBM.  Values can be streaming, bulk. Default is bulk, which is the legacy mode.')
defaultListenPort = Param(parent='undefined', name='defaultListenPort', doc='The default listen port on executors, used for testing')
deterministic = Param(parent='undefined', name='deterministic', doc='Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters.  Note: setting this to true may slow down training.  To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true')
driverListenPort = Param(parent='undefined', name='driverListenPort', doc='The listen port on a driver. Default value is 0 (random)')
dropRate = Param(parent='undefined', name='dropRate', doc='Dropout rate: a fraction of previous trees to drop during the dropout')
dropSeed = Param(parent='undefined', name='dropSeed', doc='Random seed to choose dropping models. Only used in dart.')
earlyStoppingRound = Param(parent='undefined', name='earlyStoppingRound', doc='Early stopping round')
evalAt = Param(parent='undefined', name='evalAt', doc='NDCG and MAP evaluation positions, separated by comma')
executionMode = Param(parent='undefined', name='executionMode', doc='Deprecated. Please use dataTransferMode.')
extraSeed = Param(parent='undefined', name='extraSeed', doc='Random seed for selecting threshold when extra_trees is true')
featureFraction = Param(parent='undefined', name='featureFraction', doc='Feature fraction')
featureFractionByNode = Param(parent='undefined', name='featureFractionByNode', doc='Feature fraction by node')
featureFractionSeed = Param(parent='undefined', name='featureFractionSeed', doc='Feature fraction seed')
featuresCol = Param(parent='undefined', name='featuresCol', doc='features column name')
featuresShapCol = Param(parent='undefined', name='featuresShapCol', doc='Output SHAP vector column name after prediction containing the feature contribution values')
fobj = Param(parent='undefined', name='fobj', doc='Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).')
getBaggingFraction()[source]
Returns

Bagging fraction

Return type

baggingFraction

getBaggingFreq()[source]
Returns

Bagging frequency

Return type

baggingFreq

getBaggingSeed()[source]
Returns

Bagging seed

Return type

baggingSeed

getBinSampleCount()[source]
Returns

Number of samples considered at computing histogram bins

Return type

binSampleCount

getBoostFromAverage()[source]
Returns

Adjusts initial score to the mean of labels for faster convergence

Return type

boostFromAverage

getBoostingType()[source]
Returns

Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

Return type

boostingType

getCatSmooth()[source]
Returns

this can reduce the effect of noises in categorical features, especially for categories with few data

Return type

catSmooth

getCategoricalSlotIndexes()[source]
Returns

List of categorical column indexes, the slot index in the features column

Return type

categoricalSlotIndexes

getCategoricalSlotNames()[source]
Returns

List of categorical column slot names, the slot name in the features column

Return type

categoricalSlotNames

getCatl2()[source]
Returns

L2 regularization in categorical split

Return type

catl2

getChunkSize()[source]
Returns

Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

Return type

chunkSize

getDataRandomSeed()[source]
Returns

Random seed for sampling data to construct histogram bins.

Return type

dataRandomSeed

getDataTransferMode()[source]
Returns

Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.

Return type

dataTransferMode

getDefaultListenPort()[source]
Returns

The default listen port on executors, used for testing

Return type

defaultListenPort

getDeterministic()[source]
Returns

Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true

Return type

deterministic

getDriverListenPort()[source]
Returns

The listen port on a driver. Default value is 0 (random)

Return type

driverListenPort

getDropRate()[source]
Returns

Dropout rate: a fraction of previous trees to drop during the dropout

Return type

dropRate

getDropSeed()[source]
Returns

Random seed to choose dropping models. Only used in dart.

Return type

dropSeed

getEarlyStoppingRound()[source]
Returns

Early stopping round

Return type

earlyStoppingRound

getEvalAt()[source]
Returns

NDCG and MAP evaluation positions, separated by comma

Return type

evalAt

getExecutionMode()[source]
Returns

Deprecated. Please use dataTransferMode.

Return type

executionMode

getExtraSeed()[source]
Returns

Random seed for selecting threshold when extra_trees is true

Return type

extraSeed

getFeatureFraction()[source]
Returns

Feature fraction

Return type

featureFraction

getFeatureFractionByNode()[source]
Returns

Feature fraction by node

Return type

featureFractionByNode

getFeatureFractionSeed()[source]
Returns

Feature fraction seed

Return type

featureFractionSeed

getFeaturesCol()[source]
Returns

features column name

Return type

featuresCol

getFeaturesShapCol()[source]
Returns

Output SHAP vector column name after prediction containing the feature contribution values

Return type

featuresShapCol

getFobj()[source]
Returns

Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

Return type

fobj

getGroupCol()[source]
Returns

The name of the group column

Return type

groupCol

getImprovementTolerance()[source]
Returns

Tolerance to consider improvement in metric

Return type

improvementTolerance

getInitScoreCol()[source]
Returns

The name of the initial score column, used for continued training

Return type

initScoreCol

getIsEnableSparse()[source]
Returns

Used to enable/disable sparse optimization

Return type

isEnableSparse

getIsProvideTrainingMetric()[source]
Returns

Whether output metric result over training dataset.

Return type

isProvideTrainingMetric

static getJavaPackage()[source]

Returns package name String.

getLabelCol()[source]
Returns

label column name

Return type

labelCol

getLabelGain()[source]
Returns

graded relevance for each label in NDCG

Return type

labelGain

getLambdaL1()[source]
Returns

L1 regularization

Return type

lambdaL1

getLambdaL2()[source]
Returns

L2 regularization

Return type

lambdaL2

getLeafPredictionCol()[source]
Returns

Predicted leaf indices’s column name

Return type

leafPredictionCol

getLearningRate()[source]
Returns

Learning rate or shrinkage rate

Return type

learningRate

getMatrixType()[source]
Returns

Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

Return type

matrixType

getMaxBin()[source]
Returns

Max bin

Return type

maxBin

getMaxBinByFeature()[source]
Returns

Max number of bins for each feature

Return type

maxBinByFeature

getMaxCatThreshold()[source]
Returns

limit number of split points considered for categorical features

Return type

maxCatThreshold

getMaxCatToOnehot()[source]
Returns

when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used

Return type

maxCatToOnehot

getMaxDeltaStep()[source]
Returns

Used to limit the max output of tree leaves

Return type

maxDeltaStep

getMaxDepth()[source]
Returns

Max depth

Return type

maxDepth

getMaxDrop()[source]
Returns

Max number of dropped trees during one boosting iteration

Return type

maxDrop

getMaxNumClasses()[source]
Returns

Number of max classes to infer numClass in multi-class classification.

Return type

maxNumClasses

getMaxPosition()[source]
Returns

optimized NDCG at this position

Return type

maxPosition

getMaxStreamingOMPThreads()[source]
Returns

Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.

Return type

maxStreamingOMPThreads

getMetric()[source]
Returns

Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

Return type

metric

getMicroBatchSize()[source]
Returns

Specify how many elements are sent in a streaming micro-batch.

Return type

microBatchSize

getMinDataInLeaf()[source]
Returns

Minimal number of data in one leaf. Can be used to deal with over-fitting.

Return type

minDataInLeaf

getMinDataPerBin()[source]
Returns

Minimal number of data inside one bin

Return type

minDataPerBin

getMinDataPerGroup()[source]
Returns

minimal number of data per categorical group

Return type

minDataPerGroup

getMinGainToSplit()[source]
Returns

The minimal gain to perform split

Return type

minGainToSplit

getMinSumHessianInLeaf()[source]
Returns

Minimal sum hessian in one leaf

Return type

minSumHessianInLeaf

getModelString()[source]
Returns

LightGBM model to retrain

Return type

modelString

getMonotoneConstraints()[source]
Returns

used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.

Return type

monotoneConstraints

getMonotoneConstraintsMethod()[source]
Returns

Monotone constraints method. basic, intermediate, or advanced.

Return type

monotoneConstraintsMethod

getMonotonePenalty()[source]
Returns

A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.

Return type

monotonePenalty

getNegBaggingFraction()[source]
Returns

Negative Bagging fraction

Return type

negBaggingFraction

getNumBatches()[source]
Returns

If greater than 0, splits data into separate batches during training

Return type

numBatches

getNumIterations()[source]
Returns

Number of iterations, LightGBM constructs num_class * num_iterations trees

Return type

numIterations

getNumLeaves()[source]
Returns

Number of leaves

Return type

numLeaves

getNumTasks()[source]
Returns

Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

Return type

numTasks

getNumThreads()[source]
Returns

Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.

Return type

numThreads

getObjective()[source]
Returns

The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

Return type

objective

getObjectiveSeed()[source]
Returns

Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.

Return type

objectiveSeed

getOtherRate()[source]
Returns

The retain ratio of small gradient data. Only used in goss.

Return type

otherRate

getParallelism()[source]
Returns

Tree learner parallelism, can be set to data_parallel or voting_parallel

Return type

parallelism

getPassThroughArgs()[source]
Returns

Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true

Return type

passThroughArgs

getPosBaggingFraction()[source]
Returns

Positive Bagging fraction

Return type

posBaggingFraction

getPredictDisableShapeCheck()[source]
Returns

control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data

Return type

predictDisableShapeCheck

getPredictionCol()[source]
Returns

prediction column name

Return type

predictionCol

getReferenceDataset()[source]
Returns

The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().

Return type

referenceDataset

getRepartitionByGroupingColumn()[source]
Returns

Repartition training data according to grouping column, on by default.

Return type

repartitionByGroupingColumn

getSamplingMode()[source]
Returns

Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.

Return type

samplingMode

getSamplingSubsetSize()[source]
Returns

Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.

Return type

samplingSubsetSize

getSeed()[source]
Returns

Main seed, used to generate other seeds

Return type

seed

getSkipDrop()[source]
Returns

Probability of skipping the dropout procedure during a boosting iteration

Return type

skipDrop

getSlotNames()[source]
Returns

List of slot names in the features column

Return type

slotNames

getTimeout()[source]
Returns

Timeout in seconds

Return type

timeout

getTopK()[source]
Returns

The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

Return type

topK

getTopRate()[source]
Returns

The retain ratio of large gradient data. Only used in goss.

Return type

topRate

getUniformDrop()[source]
Returns

Set this to true to use uniform drop in dart mode

Return type

uniformDrop

getUseBarrierExecutionMode()[source]
Returns

Barrier execution mode which uses a barrier stage, off by default.

Return type

useBarrierExecutionMode

getUseMissing()[source]
Returns

Set this to false to disable the special handle of missing value

Return type

useMissing

getUseSingleDatasetMode()[source]
Returns

Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.

Return type

useSingleDatasetMode

getValidationIndicatorCol()[source]
Returns

Indicates whether the row is for training or validation

Return type

validationIndicatorCol

getVerbosity()[source]
Returns

Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

Return type

verbosity

getWeightCol()[source]
Returns

The name of the weight column

Return type

weightCol

getXGBoostDartMode()[source]
Returns

Set this to true to use xgboost dart mode

Return type

xGBoostDartMode

getZeroAsMissing()[source]
Returns

Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values

Return type

zeroAsMissing

groupCol = Param(parent='undefined', name='groupCol', doc='The name of the group column')
improvementTolerance = Param(parent='undefined', name='improvementTolerance', doc='Tolerance to consider improvement in metric')
initScoreCol = Param(parent='undefined', name='initScoreCol', doc='The name of the initial score column, used for continued training')
isEnableSparse = Param(parent='undefined', name='isEnableSparse', doc='Used to enable/disable sparse optimization')
isProvideTrainingMetric = Param(parent='undefined', name='isProvideTrainingMetric', doc='Whether output metric result over training dataset.')
labelCol = Param(parent='undefined', name='labelCol', doc='label column name')
labelGain = Param(parent='undefined', name='labelGain', doc='graded relevance for each label in NDCG')
lambdaL1 = Param(parent='undefined', name='lambdaL1', doc='L1 regularization')
lambdaL2 = Param(parent='undefined', name='lambdaL2', doc='L2 regularization')
leafPredictionCol = Param(parent='undefined', name='leafPredictionCol', doc="Predicted leaf indices's column name")
learningRate = Param(parent='undefined', name='learningRate', doc='Learning rate or shrinkage rate')
matrixType = Param(parent='undefined', name='matrixType', doc='Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense.  Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.')
maxBin = Param(parent='undefined', name='maxBin', doc='Max bin')
maxBinByFeature = Param(parent='undefined', name='maxBinByFeature', doc='Max number of bins for each feature')
maxCatThreshold = Param(parent='undefined', name='maxCatThreshold', doc='limit number of split points considered for categorical features')
maxCatToOnehot = Param(parent='undefined', name='maxCatToOnehot', doc='when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used')
maxDeltaStep = Param(parent='undefined', name='maxDeltaStep', doc='Used to limit the max output of tree leaves')
maxDepth = Param(parent='undefined', name='maxDepth', doc='Max depth')
maxDrop = Param(parent='undefined', name='maxDrop', doc='Max number of dropped trees during one boosting iteration')
maxNumClasses = Param(parent='undefined', name='maxNumClasses', doc='Number of max classes to infer numClass in multi-class classification.')
maxPosition = Param(parent='undefined', name='maxPosition', doc='optimized NDCG at this position')
maxStreamingOMPThreads = Param(parent='undefined', name='maxStreamingOMPThreads', doc="Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it's best to set a fixed value.")
metric = Param(parent='undefined', name='metric', doc='Metrics to be evaluated on the evaluation data.  Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv. ')
microBatchSize = Param(parent='undefined', name='microBatchSize', doc='Specify how many elements are sent in a streaming micro-batch.')
minDataInLeaf = Param(parent='undefined', name='minDataInLeaf', doc='Minimal number of data in one leaf. Can be used to deal with over-fitting.')
minDataPerBin = Param(parent='undefined', name='minDataPerBin', doc='Minimal number of data inside one bin')
minDataPerGroup = Param(parent='undefined', name='minDataPerGroup', doc='minimal number of data per categorical group')
minGainToSplit = Param(parent='undefined', name='minGainToSplit', doc='The minimal gain to perform split')
minSumHessianInLeaf = Param(parent='undefined', name='minSumHessianInLeaf', doc='Minimal sum hessian in one leaf')
modelString = Param(parent='undefined', name='modelString', doc='LightGBM model to retrain')
monotoneConstraints = Param(parent='undefined', name='monotoneConstraints', doc='used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.')
monotoneConstraintsMethod = Param(parent='undefined', name='monotoneConstraintsMethod', doc='Monotone constraints method. basic, intermediate, or advanced.')
monotonePenalty = Param(parent='undefined', name='monotonePenalty', doc='A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.')
negBaggingFraction = Param(parent='undefined', name='negBaggingFraction', doc='Negative Bagging fraction')
numBatches = Param(parent='undefined', name='numBatches', doc='If greater than 0, splits data into separate batches during training')
numIterations = Param(parent='undefined', name='numIterations', doc='Number of iterations, LightGBM constructs num_class * num_iterations trees')
numLeaves = Param(parent='undefined', name='numLeaves', doc='Number of leaves')
numTasks = Param(parent='undefined', name='numTasks', doc='Advanced parameter to specify the number of tasks.  SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.')
numThreads = Param(parent='undefined', name='numThreads', doc='Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.')
objective = Param(parent='undefined', name='objective', doc='The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova. ')
objectiveSeed = Param(parent='undefined', name='objectiveSeed', doc='Random seed for objectives, if random process is needed.  Currently used only for rank_xendcg objective.')
otherRate = Param(parent='undefined', name='otherRate', doc='The retain ratio of small gradient data. Only used in goss.')
parallelism = Param(parent='undefined', name='parallelism', doc='Tree learner parallelism, can be set to data_parallel or voting_parallel')
passThroughArgs = Param(parent='undefined', name='passThroughArgs', doc='Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true')
posBaggingFraction = Param(parent='undefined', name='posBaggingFraction', doc='Positive Bagging fraction')
predictDisableShapeCheck = Param(parent='undefined', name='predictDisableShapeCheck', doc='control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data')
predictionCol = Param(parent='undefined', name='predictionCol', doc='prediction column name')
classmethod read()[source]

Returns an MLReader instance for this class.

referenceDataset = Param(parent='undefined', name='referenceDataset', doc='The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().')
repartitionByGroupingColumn = Param(parent='undefined', name='repartitionByGroupingColumn', doc='Repartition training data according to grouping column, on by default.')
samplingMode = Param(parent='undefined', name='samplingMode', doc="Data sampling for streaming mode. Sampled data is used to define bins. 'global': sample from all data, 'subset': sample from first N rows, or 'fixed': Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.")
samplingSubsetSize = Param(parent='undefined', name='samplingSubsetSize', doc="Specify subset size N for the sampling mode 'subset'. 'binSampleCount' rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.")
seed = Param(parent='undefined', name='seed', doc='Main seed, used to generate other seeds')
setBaggingFraction(value)[source]
Parameters

baggingFraction – Bagging fraction

setBaggingFreq(value)[source]
Parameters

baggingFreq – Bagging frequency

setBaggingSeed(value)[source]
Parameters

baggingSeed – Bagging seed

setBinSampleCount(value)[source]
Parameters

binSampleCount – Number of samples considered at computing histogram bins

setBoostFromAverage(value)[source]
Parameters

boostFromAverage – Adjusts initial score to the mean of labels for faster convergence

setBoostingType(value)[source]
Parameters

boostingType – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

setCatSmooth(value)[source]
Parameters

catSmooth – this can reduce the effect of noises in categorical features, especially for categories with few data

setCategoricalSlotIndexes(value)[source]
Parameters

categoricalSlotIndexes – List of categorical column indexes, the slot index in the features column

setCategoricalSlotNames(value)[source]
Parameters

categoricalSlotNames – List of categorical column slot names, the slot name in the features column

setCatl2(value)[source]
Parameters

catl2 – L2 regularization in categorical split

setChunkSize(value)[source]
Parameters

chunkSize – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

setDataRandomSeed(value)[source]
Parameters

dataRandomSeed – Random seed for sampling data to construct histogram bins.

setDataTransferMode(value)[source]
Parameters

dataTransferMode – Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.

setDefaultListenPort(value)[source]
Parameters

defaultListenPort – The default listen port on executors, used for testing

setDeterministic(value)[source]
Parameters

deterministic – Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true

setDriverListenPort(value)[source]
Parameters

driverListenPort – The listen port on a driver. Default value is 0 (random)

setDropRate(value)[source]
Parameters

dropRate – Dropout rate: a fraction of previous trees to drop during the dropout

setDropSeed(value)[source]
Parameters

dropSeed – Random seed to choose dropping models. Only used in dart.

setEarlyStoppingRound(value)[source]
Parameters

earlyStoppingRound – Early stopping round

setEvalAt(value)[source]
Parameters

evalAt – NDCG and MAP evaluation positions, separated by comma

setExecutionMode(value)[source]
Parameters

executionMode – Deprecated. Please use dataTransferMode.

setExtraSeed(value)[source]
Parameters

extraSeed – Random seed for selecting threshold when extra_trees is true

setFeatureFraction(value)[source]
Parameters

featureFraction – Feature fraction

setFeatureFractionByNode(value)[source]
Parameters

featureFractionByNode – Feature fraction by node

setFeatureFractionSeed(value)[source]
Parameters

featureFractionSeed – Feature fraction seed

setFeaturesCol(value)[source]
Parameters

featuresCol – features column name

setFeaturesShapCol(value)[source]
Parameters

featuresShapCol – Output SHAP vector column name after prediction containing the feature contribution values

setFobj(value)[source]
Parameters

fobj – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

setGroupCol(value)[source]
Parameters

groupCol – The name of the group column

setImprovementTolerance(value)[source]
Parameters

improvementTolerance – Tolerance to consider improvement in metric

setInitScoreCol(value)[source]
Parameters

initScoreCol – The name of the initial score column, used for continued training

setIsEnableSparse(value)[source]
Parameters

isEnableSparse – Used to enable/disable sparse optimization

setIsProvideTrainingMetric(value)[source]
Parameters

isProvideTrainingMetric – Whether output metric result over training dataset.

setLabelCol(value)[source]
Parameters

labelCol – label column name

setLabelGain(value)[source]
Parameters

labelGain – graded relevance for each label in NDCG

setLambdaL1(value)[source]
Parameters

lambdaL1 – L1 regularization

setLambdaL2(value)[source]
Parameters

lambdaL2 – L2 regularization

setLeafPredictionCol(value)[source]
Parameters

leafPredictionCol – Predicted leaf indices’s column name

setLearningRate(value)[source]
Parameters

learningRate – Learning rate or shrinkage rate

setMatrixType(value)[source]
Parameters

matrixType – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

setMaxBin(value)[source]
Parameters

maxBin – Max bin

setMaxBinByFeature(value)[source]
Parameters

maxBinByFeature – Max number of bins for each feature

setMaxCatThreshold(value)[source]
Parameters

maxCatThreshold – limit number of split points considered for categorical features

setMaxCatToOnehot(value)[source]
Parameters

maxCatToOnehot – when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used

setMaxDeltaStep(value)[source]
Parameters

maxDeltaStep – Used to limit the max output of tree leaves

setMaxDepth(value)[source]
Parameters

maxDepth – Max depth

setMaxDrop(value)[source]
Parameters

maxDrop – Max number of dropped trees during one boosting iteration

setMaxNumClasses(value)[source]
Parameters

maxNumClasses – Number of max classes to infer numClass in multi-class classification.

setMaxPosition(value)[source]
Parameters

maxPosition – optimized NDCG at this position

setMaxStreamingOMPThreads(value)[source]
Parameters

maxStreamingOMPThreads – Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.

setMetric(value)[source]
Parameters

metric – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

setMicroBatchSize(value)[source]
Parameters

microBatchSize – Specify how many elements are sent in a streaming micro-batch.

setMinDataInLeaf(value)[source]
Parameters

minDataInLeaf – Minimal number of data in one leaf. Can be used to deal with over-fitting.

setMinDataPerBin(value)[source]
Parameters

minDataPerBin – Minimal number of data inside one bin

setMinDataPerGroup(value)[source]
Parameters

minDataPerGroup – minimal number of data per categorical group

setMinGainToSplit(value)[source]
Parameters

minGainToSplit – The minimal gain to perform split

setMinSumHessianInLeaf(value)[source]
Parameters

minSumHessianInLeaf – Minimal sum hessian in one leaf

setModelString(value)[source]
Parameters

modelString – LightGBM model to retrain

setMonotoneConstraints(value)[source]
Parameters

monotoneConstraints – used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.

setMonotoneConstraintsMethod(value)[source]
Parameters

monotoneConstraintsMethod – Monotone constraints method. basic, intermediate, or advanced.

setMonotonePenalty(value)[source]
Parameters

monotonePenalty – A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.

setNegBaggingFraction(value)[source]
Parameters

negBaggingFraction – Negative Bagging fraction

setNumBatches(value)[source]
Parameters

numBatches – If greater than 0, splits data into separate batches during training

setNumIterations(value)[source]
Parameters

numIterations – Number of iterations, LightGBM constructs num_class * num_iterations trees

setNumLeaves(value)[source]
Parameters

numLeaves – Number of leaves

setNumTasks(value)[source]
Parameters

numTasks – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

setNumThreads(value)[source]
Parameters

numThreads – Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.

setObjective(value)[source]
Parameters

objective – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

setObjectiveSeed(value)[source]
Parameters

objectiveSeed – Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.

setOtherRate(value)[source]
Parameters

otherRate – The retain ratio of small gradient data. Only used in goss.

setParallelism(value)[source]
Parameters

parallelism – Tree learner parallelism, can be set to data_parallel or voting_parallel

setParams(baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', catSmooth=10.0, categoricalSlotIndexes=[], categoricalSlotNames=[], catl2=10.0, chunkSize=10000, dataRandomSeed=1, dataTransferMode='bulk', defaultListenPort=12400, deterministic=False, driverListenPort=0, dropRate=0.1, dropSeed=4, earlyStoppingRound=0, evalAt=[1, 2, 3, 4, 5], executionMode=None, extraSeed=6, featureFraction=1.0, featureFractionByNode=None, featureFractionSeed=2, featuresCol='features', featuresShapCol='', fobj=None, groupCol=None, improvementTolerance=0.0, initScoreCol=None, isEnableSparse=True, isProvideTrainingMetric=False, labelCol='label', labelGain=[], lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxCatThreshold=32, maxCatToOnehot=4, maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxNumClasses=100, maxPosition=20, maxStreamingOMPThreads=16, metric='', microBatchSize=100, minDataInLeaf=20, minDataPerBin=3, minDataPerGroup=100, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', monotoneConstraints=[], monotoneConstraintsMethod='basic', monotonePenalty=0.0, negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='lambdarank', objectiveSeed=5, otherRate=0.1, parallelism='data_parallel', passThroughArgs='', posBaggingFraction=1.0, predictDisableShapeCheck=False, predictionCol='prediction', referenceDataset=None, repartitionByGroupingColumn=True, samplingMode='subset', samplingSubsetSize=1000000, seed=None, skipDrop=0.5, slotNames=[], timeout=1200.0, topK=20, topRate=0.2, uniformDrop=False, useBarrierExecutionMode=False, useMissing=True, useSingleDatasetMode=True, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xGBoostDartMode=False, zeroAsMissing=False)[source]

Set the (keyword only) parameters

setPassThroughArgs(value)[source]
Parameters

passThroughArgs – Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true

setPosBaggingFraction(value)[source]
Parameters

posBaggingFraction – Positive Bagging fraction

setPredictDisableShapeCheck(value)[source]
Parameters

predictDisableShapeCheck – control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data

setPredictionCol(value)[source]
Parameters

predictionCol – prediction column name

setReferenceDataset(value)[source]
Parameters

referenceDataset – The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().

setRepartitionByGroupingColumn(value)[source]
Parameters

repartitionByGroupingColumn – Repartition training data according to grouping column, on by default.

setSamplingMode(value)[source]
Parameters

samplingMode – Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.

setSamplingSubsetSize(value)[source]
Parameters

samplingSubsetSize – Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.

setSeed(value)[source]
Parameters

seed – Main seed, used to generate other seeds

setSkipDrop(value)[source]
Parameters

skipDrop – Probability of skipping the dropout procedure during a boosting iteration

setSlotNames(value)[source]
Parameters

slotNames – List of slot names in the features column

setTimeout(value)[source]
Parameters

timeout – Timeout in seconds

setTopK(value)[source]
Parameters

topK – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

setTopRate(value)[source]
Parameters

topRate – The retain ratio of large gradient data. Only used in goss.

setUniformDrop(value)[source]
Parameters

uniformDrop – Set this to true to use uniform drop in dart mode

setUseBarrierExecutionMode(value)[source]
Parameters

useBarrierExecutionMode – Barrier execution mode which uses a barrier stage, off by default.

setUseMissing(value)[source]
Parameters

useMissing – Set this to false to disable the special handle of missing value

setUseSingleDatasetMode(value)[source]
Parameters

useSingleDatasetMode – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.

setValidationIndicatorCol(value)[source]
Parameters

validationIndicatorCol – Indicates whether the row is for training or validation

setVerbosity(value)[source]
Parameters

verbosity – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

setWeightCol(value)[source]
Parameters

weightCol – The name of the weight column

setXGBoostDartMode(value)[source]
Parameters

xGBoostDartMode – Set this to true to use xgboost dart mode

setZeroAsMissing(value)[source]
Parameters

zeroAsMissing – Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values

skipDrop = Param(parent='undefined', name='skipDrop', doc='Probability of skipping the dropout procedure during a boosting iteration')
slotNames = Param(parent='undefined', name='slotNames', doc='List of slot names in the features column')
timeout = Param(parent='undefined', name='timeout', doc='Timeout in seconds')
topK = Param(parent='undefined', name='topK', doc='The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0')
topRate = Param(parent='undefined', name='topRate', doc='The retain ratio of large gradient data. Only used in goss.')
uniformDrop = Param(parent='undefined', name='uniformDrop', doc='Set this to true to use uniform drop in dart mode')
useBarrierExecutionMode = Param(parent='undefined', name='useBarrierExecutionMode', doc='Barrier execution mode which uses a barrier stage, off by default.')
useMissing = Param(parent='undefined', name='useMissing', doc='Set this to false to disable the special handle of missing value')
useSingleDatasetMode = Param(parent='undefined', name='useSingleDatasetMode', doc='Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.')
validationIndicatorCol = Param(parent='undefined', name='validationIndicatorCol', doc='Indicates whether the row is for training or validation')
verbosity = Param(parent='undefined', name='verbosity', doc='Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug')
weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')
xGBoostDartMode = Param(parent='undefined', name='xGBoostDartMode', doc='Set this to true to use xgboost dart mode')
zeroAsMissing = Param(parent='undefined', name='zeroAsMissing', doc='Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values')

synapse.ml.lightgbm.LightGBMRankerModel module

class synapse.ml.lightgbm.LightGBMRankerModel.LightGBMRankerModel(java_obj=None, featuresCol='features', featuresShapCol='', labelCol='label', leafPredictionCol='', lightGBMBooster=None, numIterations=- 1, predictDisableShapeCheck=False, predictionCol='prediction', startIteration=0)[source]

Bases: synapse.ml.lightgbm.mixin.LightGBMModelMixin, synapse.ml.lightgbm._LightGBMRankerModel._LightGBMRankerModel

getBoosterNumClasses()[source]

Get the number of classes from the booster.

Returns

The number of classes.

static loadNativeModelFromFile(filename)[source]

Load the model from a native LightGBM text file.

static loadNativeModelFromString(model)[source]

Load the model from a native LightGBM model string.

synapse.ml.lightgbm.LightGBMRegressionModel module

class synapse.ml.lightgbm.LightGBMRegressionModel.LightGBMRegressionModel(java_obj=None, featuresCol='features', featuresShapCol='', labelCol='label', leafPredictionCol='', lightGBMBooster=None, numIterations=- 1, predictDisableShapeCheck=False, predictionCol='prediction', startIteration=0)[source]

Bases: synapse.ml.lightgbm.mixin.LightGBMModelMixin, synapse.ml.lightgbm._LightGBMRegressionModel._LightGBMRegressionModel

static loadNativeModelFromFile(filename)[source]

Load the model from a native LightGBM text file.

static loadNativeModelFromString(model)[source]

Load the model from a native LightGBM model string.

synapse.ml.lightgbm.LightGBMRegressor module

class synapse.ml.lightgbm.LightGBMRegressor.LightGBMRegressor(java_obj=None, alpha=0.9, baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', catSmooth=10.0, categoricalSlotIndexes=[], categoricalSlotNames=[], catl2=10.0, chunkSize=10000, dataRandomSeed=1, dataTransferMode='bulk', defaultListenPort=12400, deterministic=False, driverListenPort=0, dropRate=0.1, dropSeed=4, earlyStoppingRound=0, executionMode=None, extraSeed=6, featureFraction=1.0, featureFractionByNode=None, featureFractionSeed=2, featuresCol='features', featuresShapCol='', fobj=None, improvementTolerance=0.0, initScoreCol=None, isEnableSparse=True, isProvideTrainingMetric=False, labelCol='label', lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxCatThreshold=32, maxCatToOnehot=4, maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxNumClasses=100, maxStreamingOMPThreads=16, metric='', microBatchSize=100, minDataInLeaf=20, minDataPerBin=3, minDataPerGroup=100, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', monotoneConstraints=[], monotoneConstraintsMethod='basic', monotonePenalty=0.0, negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='regression', objectiveSeed=5, otherRate=0.1, parallelism='data_parallel', passThroughArgs='', posBaggingFraction=1.0, predictDisableShapeCheck=False, predictionCol='prediction', referenceDataset=None, repartitionByGroupingColumn=True, samplingMode='subset', samplingSubsetSize=1000000, seed=None, skipDrop=0.5, slotNames=[], timeout=1200.0, topK=20, topRate=0.2, tweedieVariancePower=1.5, uniformDrop=False, useBarrierExecutionMode=False, useMissing=True, useSingleDatasetMode=True, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xGBoostDartMode=False, zeroAsMissing=False)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaEstimator

Parameters
  • alpha (float) – parameter for Huber loss and Quantile regression

  • baggingFraction (float) – Bagging fraction

  • baggingFreq (int) – Bagging frequency

  • baggingSeed (int) – Bagging seed

  • binSampleCount (int) – Number of samples considered at computing histogram bins

  • boostFromAverage (bool) – Adjusts initial score to the mean of labels for faster convergence

  • boostingType (str) – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

  • catSmooth (float) – this can reduce the effect of noises in categorical features, especially for categories with few data

  • categoricalSlotIndexes (list) – List of categorical column indexes, the slot index in the features column

  • categoricalSlotNames (list) – List of categorical column slot names, the slot name in the features column

  • catl2 (float) – L2 regularization in categorical split

  • chunkSize (int) – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

  • dataRandomSeed (int) – Random seed for sampling data to construct histogram bins.

  • dataTransferMode (str) – Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.

  • defaultListenPort (int) – The default listen port on executors, used for testing

  • deterministic (bool) – Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true

  • driverListenPort (int) – The listen port on a driver. Default value is 0 (random)

  • dropRate (float) – Dropout rate: a fraction of previous trees to drop during the dropout

  • dropSeed (int) – Random seed to choose dropping models. Only used in dart.

  • earlyStoppingRound (int) – Early stopping round

  • executionMode (str) – Deprecated. Please use dataTransferMode.

  • extraSeed (int) – Random seed for selecting threshold when extra_trees is true

  • featureFraction (float) – Feature fraction

  • featureFractionByNode (float) – Feature fraction by node

  • featureFractionSeed (int) – Feature fraction seed

  • featuresCol (str) – features column name

  • featuresShapCol (str) – Output SHAP vector column name after prediction containing the feature contribution values

  • fobj (object) – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

  • improvementTolerance (float) – Tolerance to consider improvement in metric

  • initScoreCol (str) – The name of the initial score column, used for continued training

  • isEnableSparse (bool) – Used to enable/disable sparse optimization

  • isProvideTrainingMetric (bool) – Whether output metric result over training dataset.

  • labelCol (str) – label column name

  • lambdaL1 (float) – L1 regularization

  • lambdaL2 (float) – L2 regularization

  • leafPredictionCol (str) – Predicted leaf indices’s column name

  • learningRate (float) – Learning rate or shrinkage rate

  • matrixType (str) – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

  • maxBin (int) – Max bin

  • maxBinByFeature (list) – Max number of bins for each feature

  • maxCatThreshold (int) – limit number of split points considered for categorical features

  • maxCatToOnehot (int) – when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used

  • maxDeltaStep (float) – Used to limit the max output of tree leaves

  • maxDepth (int) – Max depth

  • maxDrop (int) – Max number of dropped trees during one boosting iteration

  • maxNumClasses (int) – Number of max classes to infer numClass in multi-class classification.

  • maxStreamingOMPThreads (int) – Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.

  • metric (str) – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

  • microBatchSize (int) – Specify how many elements are sent in a streaming micro-batch.

  • minDataInLeaf (int) – Minimal number of data in one leaf. Can be used to deal with over-fitting.

  • minDataPerBin (int) – Minimal number of data inside one bin

  • minDataPerGroup (int) – minimal number of data per categorical group

  • minGainToSplit (float) – The minimal gain to perform split

  • minSumHessianInLeaf (float) – Minimal sum hessian in one leaf

  • modelString (str) – LightGBM model to retrain

  • monotoneConstraints (list) – used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.

  • monotoneConstraintsMethod (str) – Monotone constraints method. basic, intermediate, or advanced.

  • monotonePenalty (float) – A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.

  • negBaggingFraction (float) – Negative Bagging fraction

  • numBatches (int) – If greater than 0, splits data into separate batches during training

  • numIterations (int) – Number of iterations, LightGBM constructs num_class * num_iterations trees

  • numLeaves (int) – Number of leaves

  • numTasks (int) – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

  • numThreads (int) – Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.

  • objective (str) – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

  • objectiveSeed (int) – Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.

  • otherRate (float) – The retain ratio of small gradient data. Only used in goss.

  • parallelism (str) – Tree learner parallelism, can be set to data_parallel or voting_parallel

  • passThroughArgs (str) – Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true

  • posBaggingFraction (float) – Positive Bagging fraction

  • predictDisableShapeCheck (bool) – control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data

  • predictionCol (str) – prediction column name

  • referenceDataset (list) – The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().

  • repartitionByGroupingColumn (bool) – Repartition training data according to grouping column, on by default.

  • samplingMode (str) – Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.

  • samplingSubsetSize (int) – Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.

  • seed (int) – Main seed, used to generate other seeds

  • skipDrop (float) – Probability of skipping the dropout procedure during a boosting iteration

  • slotNames (list) – List of slot names in the features column

  • timeout (float) – Timeout in seconds

  • topK (int) – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

  • topRate (float) – The retain ratio of large gradient data. Only used in goss.

  • tweedieVariancePower (float) – control the variance of tweedie distribution, must be between 1 and 2

  • uniformDrop (bool) – Set this to true to use uniform drop in dart mode

  • useBarrierExecutionMode (bool) – Barrier execution mode which uses a barrier stage, off by default.

  • useMissing (bool) – Set this to false to disable the special handle of missing value

  • useSingleDatasetMode (bool) – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.

  • validationIndicatorCol (str) – Indicates whether the row is for training or validation

  • verbosity (int) – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

  • weightCol (str) – The name of the weight column

  • xGBoostDartMode (bool) – Set this to true to use xgboost dart mode

  • zeroAsMissing (bool) – Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values

alpha = Param(parent='undefined', name='alpha', doc='parameter for Huber loss and Quantile regression')
baggingFraction = Param(parent='undefined', name='baggingFraction', doc='Bagging fraction')
baggingFreq = Param(parent='undefined', name='baggingFreq', doc='Bagging frequency')
baggingSeed = Param(parent='undefined', name='baggingSeed', doc='Bagging seed')
binSampleCount = Param(parent='undefined', name='binSampleCount', doc='Number of samples considered at computing histogram bins')
boostFromAverage = Param(parent='undefined', name='boostFromAverage', doc='Adjusts initial score to the mean of labels for faster convergence')
boostingType = Param(parent='undefined', name='boostingType', doc='Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling). ')
catSmooth = Param(parent='undefined', name='catSmooth', doc='this can reduce the effect of noises in categorical features, especially for categories with few data')
categoricalSlotIndexes = Param(parent='undefined', name='categoricalSlotIndexes', doc='List of categorical column indexes, the slot index in the features column')
categoricalSlotNames = Param(parent='undefined', name='categoricalSlotNames', doc='List of categorical column slot names, the slot name in the features column')
catl2 = Param(parent='undefined', name='catl2', doc='L2 regularization in categorical split')
chunkSize = Param(parent='undefined', name='chunkSize', doc='Advanced parameter to specify the chunk size for copying Java data to native.  If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.')
dataRandomSeed = Param(parent='undefined', name='dataRandomSeed', doc='Random seed for sampling data to construct histogram bins.')
dataTransferMode = Param(parent='undefined', name='dataTransferMode', doc='Specify how SynapseML transfers data from Spark to LightGBM.  Values can be streaming, bulk. Default is bulk, which is the legacy mode.')
defaultListenPort = Param(parent='undefined', name='defaultListenPort', doc='The default listen port on executors, used for testing')
deterministic = Param(parent='undefined', name='deterministic', doc='Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters.  Note: setting this to true may slow down training.  To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true')
driverListenPort = Param(parent='undefined', name='driverListenPort', doc='The listen port on a driver. Default value is 0 (random)')
dropRate = Param(parent='undefined', name='dropRate', doc='Dropout rate: a fraction of previous trees to drop during the dropout')
dropSeed = Param(parent='undefined', name='dropSeed', doc='Random seed to choose dropping models. Only used in dart.')
earlyStoppingRound = Param(parent='undefined', name='earlyStoppingRound', doc='Early stopping round')
executionMode = Param(parent='undefined', name='executionMode', doc='Deprecated. Please use dataTransferMode.')
extraSeed = Param(parent='undefined', name='extraSeed', doc='Random seed for selecting threshold when extra_trees is true')
featureFraction = Param(parent='undefined', name='featureFraction', doc='Feature fraction')
featureFractionByNode = Param(parent='undefined', name='featureFractionByNode', doc='Feature fraction by node')
featureFractionSeed = Param(parent='undefined', name='featureFractionSeed', doc='Feature fraction seed')
featuresCol = Param(parent='undefined', name='featuresCol', doc='features column name')
featuresShapCol = Param(parent='undefined', name='featuresShapCol', doc='Output SHAP vector column name after prediction containing the feature contribution values')
fobj = Param(parent='undefined', name='fobj', doc='Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).')
getAlpha()[source]
Returns

parameter for Huber loss and Quantile regression

Return type

alpha

getBaggingFraction()[source]
Returns

Bagging fraction

Return type

baggingFraction

getBaggingFreq()[source]
Returns

Bagging frequency

Return type

baggingFreq

getBaggingSeed()[source]
Returns

Bagging seed

Return type

baggingSeed

getBinSampleCount()[source]
Returns

Number of samples considered at computing histogram bins

Return type

binSampleCount

getBoostFromAverage()[source]
Returns

Adjusts initial score to the mean of labels for faster convergence

Return type

boostFromAverage

getBoostingType()[source]
Returns

Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

Return type

boostingType

getCatSmooth()[source]
Returns

this can reduce the effect of noises in categorical features, especially for categories with few data

Return type

catSmooth

getCategoricalSlotIndexes()[source]
Returns

List of categorical column indexes, the slot index in the features column

Return type

categoricalSlotIndexes

getCategoricalSlotNames()[source]
Returns

List of categorical column slot names, the slot name in the features column

Return type

categoricalSlotNames

getCatl2()[source]
Returns

L2 regularization in categorical split

Return type

catl2

getChunkSize()[source]
Returns

Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

Return type

chunkSize

getDataRandomSeed()[source]
Returns

Random seed for sampling data to construct histogram bins.

Return type

dataRandomSeed

getDataTransferMode()[source]
Returns

Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.

Return type

dataTransferMode

getDefaultListenPort()[source]
Returns

The default listen port on executors, used for testing

Return type

defaultListenPort

getDeterministic()[source]
Returns

Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true

Return type

deterministic

getDriverListenPort()[source]
Returns

The listen port on a driver. Default value is 0 (random)

Return type

driverListenPort

getDropRate()[source]
Returns

Dropout rate: a fraction of previous trees to drop during the dropout

Return type

dropRate

getDropSeed()[source]
Returns

Random seed to choose dropping models. Only used in dart.

Return type

dropSeed

getEarlyStoppingRound()[source]
Returns

Early stopping round

Return type

earlyStoppingRound

getExecutionMode()[source]
Returns

Deprecated. Please use dataTransferMode.

Return type

executionMode

getExtraSeed()[source]
Returns

Random seed for selecting threshold when extra_trees is true

Return type

extraSeed

getFeatureFraction()[source]
Returns

Feature fraction

Return type

featureFraction

getFeatureFractionByNode()[source]
Returns

Feature fraction by node

Return type

featureFractionByNode

getFeatureFractionSeed()[source]
Returns

Feature fraction seed

Return type

featureFractionSeed

getFeaturesCol()[source]
Returns

features column name

Return type

featuresCol

getFeaturesShapCol()[source]
Returns

Output SHAP vector column name after prediction containing the feature contribution values

Return type

featuresShapCol

getFobj()[source]
Returns

Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

Return type

fobj

getImprovementTolerance()[source]
Returns

Tolerance to consider improvement in metric

Return type

improvementTolerance

getInitScoreCol()[source]
Returns

The name of the initial score column, used for continued training

Return type

initScoreCol

getIsEnableSparse()[source]
Returns

Used to enable/disable sparse optimization

Return type

isEnableSparse

getIsProvideTrainingMetric()[source]
Returns

Whether output metric result over training dataset.

Return type

isProvideTrainingMetric

static getJavaPackage()[source]

Returns package name String.

getLabelCol()[source]
Returns

label column name

Return type

labelCol

getLambdaL1()[source]
Returns

L1 regularization

Return type

lambdaL1

getLambdaL2()[source]
Returns

L2 regularization

Return type

lambdaL2

getLeafPredictionCol()[source]
Returns

Predicted leaf indices’s column name

Return type

leafPredictionCol

getLearningRate()[source]
Returns

Learning rate or shrinkage rate

Return type

learningRate

getMatrixType()[source]
Returns

Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

Return type

matrixType

getMaxBin()[source]
Returns

Max bin

Return type

maxBin

getMaxBinByFeature()[source]
Returns

Max number of bins for each feature

Return type

maxBinByFeature

getMaxCatThreshold()[source]
Returns

limit number of split points considered for categorical features

Return type

maxCatThreshold

getMaxCatToOnehot()[source]
Returns

when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used

Return type

maxCatToOnehot

getMaxDeltaStep()[source]
Returns

Used to limit the max output of tree leaves

Return type

maxDeltaStep

getMaxDepth()[source]
Returns

Max depth

Return type

maxDepth

getMaxDrop()[source]
Returns

Max number of dropped trees during one boosting iteration

Return type

maxDrop

getMaxNumClasses()[source]
Returns

Number of max classes to infer numClass in multi-class classification.

Return type

maxNumClasses

getMaxStreamingOMPThreads()[source]
Returns

Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.

Return type

maxStreamingOMPThreads

getMetric()[source]
Returns

Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

Return type

metric

getMicroBatchSize()[source]
Returns

Specify how many elements are sent in a streaming micro-batch.

Return type

microBatchSize

getMinDataInLeaf()[source]
Returns

Minimal number of data in one leaf. Can be used to deal with over-fitting.

Return type

minDataInLeaf

getMinDataPerBin()[source]
Returns

Minimal number of data inside one bin

Return type

minDataPerBin

getMinDataPerGroup()[source]
Returns

minimal number of data per categorical group

Return type

minDataPerGroup

getMinGainToSplit()[source]
Returns

The minimal gain to perform split

Return type

minGainToSplit

getMinSumHessianInLeaf()[source]
Returns

Minimal sum hessian in one leaf

Return type

minSumHessianInLeaf

getModelString()[source]
Returns

LightGBM model to retrain

Return type

modelString

getMonotoneConstraints()[source]
Returns

used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.

Return type

monotoneConstraints

getMonotoneConstraintsMethod()[source]
Returns

Monotone constraints method. basic, intermediate, or advanced.

Return type

monotoneConstraintsMethod

getMonotonePenalty()[source]
Returns

A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.

Return type

monotonePenalty

getNegBaggingFraction()[source]
Returns

Negative Bagging fraction

Return type

negBaggingFraction

getNumBatches()[source]
Returns

If greater than 0, splits data into separate batches during training

Return type

numBatches

getNumIterations()[source]
Returns

Number of iterations, LightGBM constructs num_class * num_iterations trees

Return type

numIterations

getNumLeaves()[source]
Returns

Number of leaves

Return type

numLeaves

getNumTasks()[source]
Returns

Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

Return type

numTasks

getNumThreads()[source]
Returns

Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.

Return type

numThreads

getObjective()[source]
Returns

The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

Return type

objective

getObjectiveSeed()[source]
Returns

Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.

Return type

objectiveSeed

getOtherRate()[source]
Returns

The retain ratio of small gradient data. Only used in goss.

Return type

otherRate

getParallelism()[source]
Returns

Tree learner parallelism, can be set to data_parallel or voting_parallel

Return type

parallelism

getPassThroughArgs()[source]
Returns

Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true

Return type

passThroughArgs

getPosBaggingFraction()[source]
Returns

Positive Bagging fraction

Return type

posBaggingFraction

getPredictDisableShapeCheck()[source]
Returns

control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data

Return type

predictDisableShapeCheck

getPredictionCol()[source]
Returns

prediction column name

Return type

predictionCol

getReferenceDataset()[source]
Returns

The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().

Return type

referenceDataset

getRepartitionByGroupingColumn()[source]
Returns

Repartition training data according to grouping column, on by default.

Return type

repartitionByGroupingColumn

getSamplingMode()[source]
Returns

Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.

Return type

samplingMode

getSamplingSubsetSize()[source]
Returns

Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.

Return type

samplingSubsetSize

getSeed()[source]
Returns

Main seed, used to generate other seeds

Return type

seed

getSkipDrop()[source]
Returns

Probability of skipping the dropout procedure during a boosting iteration

Return type

skipDrop

getSlotNames()[source]
Returns

List of slot names in the features column

Return type

slotNames

getTimeout()[source]
Returns

Timeout in seconds

Return type

timeout

getTopK()[source]
Returns

The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

Return type

topK

getTopRate()[source]
Returns

The retain ratio of large gradient data. Only used in goss.

Return type

topRate

getTweedieVariancePower()[source]
Returns

control the variance of tweedie distribution, must be between 1 and 2

Return type

tweedieVariancePower

getUniformDrop()[source]
Returns

Set this to true to use uniform drop in dart mode

Return type

uniformDrop

getUseBarrierExecutionMode()[source]
Returns

Barrier execution mode which uses a barrier stage, off by default.

Return type

useBarrierExecutionMode

getUseMissing()[source]
Returns

Set this to false to disable the special handle of missing value

Return type

useMissing

getUseSingleDatasetMode()[source]
Returns

Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.

Return type

useSingleDatasetMode

getValidationIndicatorCol()[source]
Returns

Indicates whether the row is for training or validation

Return type

validationIndicatorCol

getVerbosity()[source]
Returns

Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

Return type

verbosity

getWeightCol()[source]
Returns

The name of the weight column

Return type

weightCol

getXGBoostDartMode()[source]
Returns

Set this to true to use xgboost dart mode

Return type

xGBoostDartMode

getZeroAsMissing()[source]
Returns

Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values

Return type

zeroAsMissing

improvementTolerance = Param(parent='undefined', name='improvementTolerance', doc='Tolerance to consider improvement in metric')
initScoreCol = Param(parent='undefined', name='initScoreCol', doc='The name of the initial score column, used for continued training')
isEnableSparse = Param(parent='undefined', name='isEnableSparse', doc='Used to enable/disable sparse optimization')
isProvideTrainingMetric = Param(parent='undefined', name='isProvideTrainingMetric', doc='Whether output metric result over training dataset.')
labelCol = Param(parent='undefined', name='labelCol', doc='label column name')
lambdaL1 = Param(parent='undefined', name='lambdaL1', doc='L1 regularization')
lambdaL2 = Param(parent='undefined', name='lambdaL2', doc='L2 regularization')
leafPredictionCol = Param(parent='undefined', name='leafPredictionCol', doc="Predicted leaf indices's column name")
learningRate = Param(parent='undefined', name='learningRate', doc='Learning rate or shrinkage rate')
matrixType = Param(parent='undefined', name='matrixType', doc='Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense.  Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.')
maxBin = Param(parent='undefined', name='maxBin', doc='Max bin')
maxBinByFeature = Param(parent='undefined', name='maxBinByFeature', doc='Max number of bins for each feature')
maxCatThreshold = Param(parent='undefined', name='maxCatThreshold', doc='limit number of split points considered for categorical features')
maxCatToOnehot = Param(parent='undefined', name='maxCatToOnehot', doc='when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used')
maxDeltaStep = Param(parent='undefined', name='maxDeltaStep', doc='Used to limit the max output of tree leaves')
maxDepth = Param(parent='undefined', name='maxDepth', doc='Max depth')
maxDrop = Param(parent='undefined', name='maxDrop', doc='Max number of dropped trees during one boosting iteration')
maxNumClasses = Param(parent='undefined', name='maxNumClasses', doc='Number of max classes to infer numClass in multi-class classification.')
maxStreamingOMPThreads = Param(parent='undefined', name='maxStreamingOMPThreads', doc="Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it's best to set a fixed value.")
metric = Param(parent='undefined', name='metric', doc='Metrics to be evaluated on the evaluation data.  Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv. ')
microBatchSize = Param(parent='undefined', name='microBatchSize', doc='Specify how many elements are sent in a streaming micro-batch.')
minDataInLeaf = Param(parent='undefined', name='minDataInLeaf', doc='Minimal number of data in one leaf. Can be used to deal with over-fitting.')
minDataPerBin = Param(parent='undefined', name='minDataPerBin', doc='Minimal number of data inside one bin')
minDataPerGroup = Param(parent='undefined', name='minDataPerGroup', doc='minimal number of data per categorical group')
minGainToSplit = Param(parent='undefined', name='minGainToSplit', doc='The minimal gain to perform split')
minSumHessianInLeaf = Param(parent='undefined', name='minSumHessianInLeaf', doc='Minimal sum hessian in one leaf')
modelString = Param(parent='undefined', name='modelString', doc='LightGBM model to retrain')
monotoneConstraints = Param(parent='undefined', name='monotoneConstraints', doc='used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.')
monotoneConstraintsMethod = Param(parent='undefined', name='monotoneConstraintsMethod', doc='Monotone constraints method. basic, intermediate, or advanced.')
monotonePenalty = Param(parent='undefined', name='monotonePenalty', doc='A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.')
negBaggingFraction = Param(parent='undefined', name='negBaggingFraction', doc='Negative Bagging fraction')
numBatches = Param(parent='undefined', name='numBatches', doc='If greater than 0, splits data into separate batches during training')
numIterations = Param(parent='undefined', name='numIterations', doc='Number of iterations, LightGBM constructs num_class * num_iterations trees')
numLeaves = Param(parent='undefined', name='numLeaves', doc='Number of leaves')
numTasks = Param(parent='undefined', name='numTasks', doc='Advanced parameter to specify the number of tasks.  SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.')
numThreads = Param(parent='undefined', name='numThreads', doc='Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.')
objective = Param(parent='undefined', name='objective', doc='The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova. ')
objectiveSeed = Param(parent='undefined', name='objectiveSeed', doc='Random seed for objectives, if random process is needed.  Currently used only for rank_xendcg objective.')
otherRate = Param(parent='undefined', name='otherRate', doc='The retain ratio of small gradient data. Only used in goss.')
parallelism = Param(parent='undefined', name='parallelism', doc='Tree learner parallelism, can be set to data_parallel or voting_parallel')
passThroughArgs = Param(parent='undefined', name='passThroughArgs', doc='Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true')
posBaggingFraction = Param(parent='undefined', name='posBaggingFraction', doc='Positive Bagging fraction')
predictDisableShapeCheck = Param(parent='undefined', name='predictDisableShapeCheck', doc='control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data')
predictionCol = Param(parent='undefined', name='predictionCol', doc='prediction column name')
classmethod read()[source]

Returns an MLReader instance for this class.

referenceDataset = Param(parent='undefined', name='referenceDataset', doc='The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().')
repartitionByGroupingColumn = Param(parent='undefined', name='repartitionByGroupingColumn', doc='Repartition training data according to grouping column, on by default.')
samplingMode = Param(parent='undefined', name='samplingMode', doc="Data sampling for streaming mode. Sampled data is used to define bins. 'global': sample from all data, 'subset': sample from first N rows, or 'fixed': Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.")
samplingSubsetSize = Param(parent='undefined', name='samplingSubsetSize', doc="Specify subset size N for the sampling mode 'subset'. 'binSampleCount' rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.")
seed = Param(parent='undefined', name='seed', doc='Main seed, used to generate other seeds')
setAlpha(value)[source]
Parameters

alpha – parameter for Huber loss and Quantile regression

setBaggingFraction(value)[source]
Parameters

baggingFraction – Bagging fraction

setBaggingFreq(value)[source]
Parameters

baggingFreq – Bagging frequency

setBaggingSeed(value)[source]
Parameters

baggingSeed – Bagging seed

setBinSampleCount(value)[source]
Parameters

binSampleCount – Number of samples considered at computing histogram bins

setBoostFromAverage(value)[source]
Parameters

boostFromAverage – Adjusts initial score to the mean of labels for faster convergence

setBoostingType(value)[source]
Parameters

boostingType – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

setCatSmooth(value)[source]
Parameters

catSmooth – this can reduce the effect of noises in categorical features, especially for categories with few data

setCategoricalSlotIndexes(value)[source]
Parameters

categoricalSlotIndexes – List of categorical column indexes, the slot index in the features column

setCategoricalSlotNames(value)[source]
Parameters

categoricalSlotNames – List of categorical column slot names, the slot name in the features column

setCatl2(value)[source]
Parameters

catl2 – L2 regularization in categorical split

setChunkSize(value)[source]
Parameters

chunkSize – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

setDataRandomSeed(value)[source]
Parameters

dataRandomSeed – Random seed for sampling data to construct histogram bins.

setDataTransferMode(value)[source]
Parameters

dataTransferMode – Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.

setDefaultListenPort(value)[source]
Parameters

defaultListenPort – The default listen port on executors, used for testing

setDeterministic(value)[source]
Parameters

deterministic – Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true

setDriverListenPort(value)[source]
Parameters

driverListenPort – The listen port on a driver. Default value is 0 (random)

setDropRate(value)[source]
Parameters

dropRate – Dropout rate: a fraction of previous trees to drop during the dropout

setDropSeed(value)[source]
Parameters

dropSeed – Random seed to choose dropping models. Only used in dart.

setEarlyStoppingRound(value)[source]
Parameters

earlyStoppingRound – Early stopping round

setExecutionMode(value)[source]
Parameters

executionMode – Deprecated. Please use dataTransferMode.

setExtraSeed(value)[source]
Parameters

extraSeed – Random seed for selecting threshold when extra_trees is true

setFeatureFraction(value)[source]
Parameters

featureFraction – Feature fraction

setFeatureFractionByNode(value)[source]
Parameters

featureFractionByNode – Feature fraction by node

setFeatureFractionSeed(value)[source]
Parameters

featureFractionSeed – Feature fraction seed

setFeaturesCol(value)[source]
Parameters

featuresCol – features column name

setFeaturesShapCol(value)[source]
Parameters

featuresShapCol – Output SHAP vector column name after prediction containing the feature contribution values

setFobj(value)[source]
Parameters

fobj – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

setImprovementTolerance(value)[source]
Parameters

improvementTolerance – Tolerance to consider improvement in metric

setInitScoreCol(value)[source]
Parameters

initScoreCol – The name of the initial score column, used for continued training

setIsEnableSparse(value)[source]
Parameters

isEnableSparse – Used to enable/disable sparse optimization

setIsProvideTrainingMetric(value)[source]
Parameters

isProvideTrainingMetric – Whether output metric result over training dataset.

setLabelCol(value)[source]
Parameters

labelCol – label column name

setLambdaL1(value)[source]
Parameters

lambdaL1 – L1 regularization

setLambdaL2(value)[source]
Parameters

lambdaL2 – L2 regularization

setLeafPredictionCol(value)[source]
Parameters

leafPredictionCol – Predicted leaf indices’s column name

setLearningRate(value)[source]
Parameters

learningRate – Learning rate or shrinkage rate

setMatrixType(value)[source]
Parameters

matrixType – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

setMaxBin(value)[source]
Parameters

maxBin – Max bin

setMaxBinByFeature(value)[source]
Parameters

maxBinByFeature – Max number of bins for each feature

setMaxCatThreshold(value)[source]
Parameters

maxCatThreshold – limit number of split points considered for categorical features

setMaxCatToOnehot(value)[source]
Parameters

maxCatToOnehot – when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used

setMaxDeltaStep(value)[source]
Parameters

maxDeltaStep – Used to limit the max output of tree leaves

setMaxDepth(value)[source]
Parameters

maxDepth – Max depth

setMaxDrop(value)[source]
Parameters

maxDrop – Max number of dropped trees during one boosting iteration

setMaxNumClasses(value)[source]
Parameters

maxNumClasses – Number of max classes to infer numClass in multi-class classification.

setMaxStreamingOMPThreads(value)[source]
Parameters

maxStreamingOMPThreads – Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.

setMetric(value)[source]
Parameters

metric – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

setMicroBatchSize(value)[source]
Parameters

microBatchSize – Specify how many elements are sent in a streaming micro-batch.

setMinDataInLeaf(value)[source]
Parameters

minDataInLeaf – Minimal number of data in one leaf. Can be used to deal with over-fitting.

setMinDataPerBin(value)[source]
Parameters

minDataPerBin – Minimal number of data inside one bin

setMinDataPerGroup(value)[source]
Parameters

minDataPerGroup – minimal number of data per categorical group

setMinGainToSplit(value)[source]
Parameters

minGainToSplit – The minimal gain to perform split

setMinSumHessianInLeaf(value)[source]
Parameters

minSumHessianInLeaf – Minimal sum hessian in one leaf

setModelString(value)[source]
Parameters

modelString – LightGBM model to retrain

setMonotoneConstraints(value)[source]
Parameters

monotoneConstraints – used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.

setMonotoneConstraintsMethod(value)[source]
Parameters

monotoneConstraintsMethod – Monotone constraints method. basic, intermediate, or advanced.

setMonotonePenalty(value)[source]
Parameters

monotonePenalty – A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.

setNegBaggingFraction(value)[source]
Parameters

negBaggingFraction – Negative Bagging fraction

setNumBatches(value)[source]
Parameters

numBatches – If greater than 0, splits data into separate batches during training

setNumIterations(value)[source]
Parameters

numIterations – Number of iterations, LightGBM constructs num_class * num_iterations trees

setNumLeaves(value)[source]
Parameters

numLeaves – Number of leaves

setNumTasks(value)[source]
Parameters

numTasks – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

setNumThreads(value)[source]
Parameters

numThreads – Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.

setObjective(value)[source]
Parameters

objective – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

setObjectiveSeed(value)[source]
Parameters

objectiveSeed – Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.

setOtherRate(value)[source]
Parameters

otherRate – The retain ratio of small gradient data. Only used in goss.

setParallelism(value)[source]
Parameters

parallelism – Tree learner parallelism, can be set to data_parallel or voting_parallel

setParams(alpha=0.9, baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', catSmooth=10.0, categoricalSlotIndexes=[], categoricalSlotNames=[], catl2=10.0, chunkSize=10000, dataRandomSeed=1, dataTransferMode='bulk', defaultListenPort=12400, deterministic=False, driverListenPort=0, dropRate=0.1, dropSeed=4, earlyStoppingRound=0, executionMode=None, extraSeed=6, featureFraction=1.0, featureFractionByNode=None, featureFractionSeed=2, featuresCol='features', featuresShapCol='', fobj=None, improvementTolerance=0.0, initScoreCol=None, isEnableSparse=True, isProvideTrainingMetric=False, labelCol='label', lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxCatThreshold=32, maxCatToOnehot=4, maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxNumClasses=100, maxStreamingOMPThreads=16, metric='', microBatchSize=100, minDataInLeaf=20, minDataPerBin=3, minDataPerGroup=100, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', monotoneConstraints=[], monotoneConstraintsMethod='basic', monotonePenalty=0.0, negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='regression', objectiveSeed=5, otherRate=0.1, parallelism='data_parallel', passThroughArgs='', posBaggingFraction=1.0, predictDisableShapeCheck=False, predictionCol='prediction', referenceDataset=None, repartitionByGroupingColumn=True, samplingMode='subset', samplingSubsetSize=1000000, seed=None, skipDrop=0.5, slotNames=[], timeout=1200.0, topK=20, topRate=0.2, tweedieVariancePower=1.5, uniformDrop=False, useBarrierExecutionMode=False, useMissing=True, useSingleDatasetMode=True, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xGBoostDartMode=False, zeroAsMissing=False)[source]

Set the (keyword only) parameters

setPassThroughArgs(value)[source]
Parameters

passThroughArgs – Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true

setPosBaggingFraction(value)[source]
Parameters

posBaggingFraction – Positive Bagging fraction

setPredictDisableShapeCheck(value)[source]
Parameters

predictDisableShapeCheck – control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data

setPredictionCol(value)[source]
Parameters

predictionCol – prediction column name

setReferenceDataset(value)[source]
Parameters

referenceDataset – The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().

setRepartitionByGroupingColumn(value)[source]
Parameters

repartitionByGroupingColumn – Repartition training data according to grouping column, on by default.

setSamplingMode(value)[source]
Parameters

samplingMode – Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.

setSamplingSubsetSize(value)[source]
Parameters

samplingSubsetSize – Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.

setSeed(value)[source]
Parameters

seed – Main seed, used to generate other seeds

setSkipDrop(value)[source]
Parameters

skipDrop – Probability of skipping the dropout procedure during a boosting iteration

setSlotNames(value)[source]
Parameters

slotNames – List of slot names in the features column

setTimeout(value)[source]
Parameters

timeout – Timeout in seconds

setTopK(value)[source]
Parameters

topK – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

setTopRate(value)[source]
Parameters

topRate – The retain ratio of large gradient data. Only used in goss.

setTweedieVariancePower(value)[source]
Parameters

tweedieVariancePower – control the variance of tweedie distribution, must be between 1 and 2

setUniformDrop(value)[source]
Parameters

uniformDrop – Set this to true to use uniform drop in dart mode

setUseBarrierExecutionMode(value)[source]
Parameters

useBarrierExecutionMode – Barrier execution mode which uses a barrier stage, off by default.

setUseMissing(value)[source]
Parameters

useMissing – Set this to false to disable the special handle of missing value

setUseSingleDatasetMode(value)[source]
Parameters

useSingleDatasetMode – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.

setValidationIndicatorCol(value)[source]
Parameters

validationIndicatorCol – Indicates whether the row is for training or validation

setVerbosity(value)[source]
Parameters

verbosity – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

setWeightCol(value)[source]
Parameters

weightCol – The name of the weight column

setXGBoostDartMode(value)[source]
Parameters

xGBoostDartMode – Set this to true to use xgboost dart mode

setZeroAsMissing(value)[source]
Parameters

zeroAsMissing – Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values

skipDrop = Param(parent='undefined', name='skipDrop', doc='Probability of skipping the dropout procedure during a boosting iteration')
slotNames = Param(parent='undefined', name='slotNames', doc='List of slot names in the features column')
timeout = Param(parent='undefined', name='timeout', doc='Timeout in seconds')
topK = Param(parent='undefined', name='topK', doc='The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0')
topRate = Param(parent='undefined', name='topRate', doc='The retain ratio of large gradient data. Only used in goss.')
tweedieVariancePower = Param(parent='undefined', name='tweedieVariancePower', doc='control the variance of tweedie distribution, must be between 1 and 2')
uniformDrop = Param(parent='undefined', name='uniformDrop', doc='Set this to true to use uniform drop in dart mode')
useBarrierExecutionMode = Param(parent='undefined', name='useBarrierExecutionMode', doc='Barrier execution mode which uses a barrier stage, off by default.')
useMissing = Param(parent='undefined', name='useMissing', doc='Set this to false to disable the special handle of missing value')
useSingleDatasetMode = Param(parent='undefined', name='useSingleDatasetMode', doc='Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.')
validationIndicatorCol = Param(parent='undefined', name='validationIndicatorCol', doc='Indicates whether the row is for training or validation')
verbosity = Param(parent='undefined', name='verbosity', doc='Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug')
weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')
xGBoostDartMode = Param(parent='undefined', name='xGBoostDartMode', doc='Set this to true to use xgboost dart mode')
zeroAsMissing = Param(parent='undefined', name='zeroAsMissing', doc='Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values')

synapse.ml.lightgbm.mixin module

class synapse.ml.lightgbm.mixin.LightGBMModelMixin[source]

Bases: object

getBoosterBestIteration()[source]

Get the best iteration from the booster.

Returns

The best iteration, if early stopping was triggered.

getBoosterNumFeatures()[source]

Get the number of features from the booster.

Returns

The number of features.

getBoosterNumTotalIterations()[source]

Get the total number of iterations trained.

Returns

The total number of iterations trained.

getBoosterNumTotalModel()[source]

Get the total number of models trained.

Returns

The total number of models.

getFeatureImportances(importance_type='split')[source]

Get the feature importances as a list. The importance_type can be “split” or “gain”.

getFeatureShaps(vector)[source]

Get the local shap feature importances.

getNativeModel()[source]

Get the native model serialized representation as a string.

saveNativeModel(filename, overwrite=True)[source]

Save the booster as string format to a local or WASB remote location.

setPredictDisableShapeCheck(value=None)[source]

Set shape check or not when predict.

Module contents

SynapseML is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark in several new directions. SynapseML adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV. These tools enable powerful and highly-scalable predictive and analytical models for a variety of datasources.

SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. In this vein, SynapseML provides easy to use SparkML transformers for a wide variety of Microsoft Cognitive Services. For production grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.

SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+.