synapse.ml.lightgbm package

Submodules

synapse.ml.lightgbm.LightGBMClassificationModel module

class synapse.ml.lightgbm.LightGBMClassificationModel.LightGBMClassificationModel(java_obj=None, actualNumClasses=None, featuresCol='features', featuresShapCol='', labelCol='label', leafPredictionCol='', lightGBMBooster=None, numIterations=- 1, predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', startIteration=0, thresholds=None)[source]

Bases: synapse.ml.lightgbm.mixin.LightGBMModelMixin, synapse.ml.lightgbm._LightGBMClassificationModel._LightGBMClassificationModel

getBoosterNumClasses()[source]

Get the number of classes from the booster.

Returns: The number of classes.

static loadNativeModelFromFile(filename)[source]: Load the model from a native LightGBM text file.

static loadNativeModelFromString(model)[source]: Load the model from a native LightGBM model string.

synapse.ml.lightgbm.LightGBMClassifier module

class synapse.ml.lightgbm.LightGBMClassifier.LightGBMClassifier(java_obj=None, baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', categoricalSlotIndexes=[], categoricalSlotNames=[], chunkSize=10000, defaultListenPort=12400, driverListenPort=0, dropRate=0.1, earlyStoppingRound=0, featureFraction=1.0, featuresCol='features', featuresShapCol='', fobj=None, improvementTolerance=0.0, initScoreCol=None, isProvideTrainingMetric=False, isUnbalance=False, labelCol='label', lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, metric='', minDataInLeaf=20, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='binary', parallelism='data_parallel', posBaggingFraction=1.0, predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', repartitionByGroupingColumn=True, skipDrop=0.5, slotNames=[], thresholds=None, timeout=1200.0, topK=20, uniformDrop=False, useBarrierExecutionMode=False, useSingleDatasetMode=False, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xgboostDartMode=False)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaEstimator

Parameters

baggingFraction (float) – Bagging fraction
baggingFreq (int) – Bagging frequency
baggingSeed (int) – Bagging seed
binSampleCount (int) – Number of samples considered at computing histogram bins
boostFromAverage (bool) – Adjusts initial score to the mean of labels for faster convergence
boostingType (object) – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
categoricalSlotIndexes (list) – List of categorical column indexes, the slot index in the features column
categoricalSlotNames (list) – List of categorical column slot names, the slot name in the features column
chunkSize (int) – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
defaultListenPort (int) – The default listen port on executors, used for testing
driverListenPort (int) – The listen port on a driver. Default value is 0 (random)
dropRate (float) – Dropout rate: a fraction of previous trees to drop during the dropout
earlyStoppingRound (int) – Early stopping round
featureFraction (float) – Feature fraction
featuresCol (object) – features column name
featuresShapCol (object) – Output SHAP vector column name after prediction containing the feature contribution values
fobj (object) – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
improvementTolerance (float) – Tolerance to consider improvement in metric
initScoreCol (object) – The name of the initial score column, used for continued training
isProvideTrainingMetric (bool) – Whether output metric result over training dataset.
isUnbalance (bool) – Set to true if training data is unbalanced in binary classification scenario
labelCol (object) – label column name
lambdaL1 (float) – L1 regularization
lambdaL2 (float) – L2 regularization
leafPredictionCol (object) – Predicted leaf indices’s column name
learningRate (float) – Learning rate or shrinkage rate
matrixType (object) – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
maxBin (int) – Max bin
maxBinByFeature (list) – Max number of bins for each feature
maxDeltaStep (float) – Used to limit the max output of tree leaves
maxDepth (int) – Max depth
maxDrop (int) – Max number of dropped trees during one boosting iteration
metric (object) – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
minDataInLeaf (int) – Minimal number of data in one leaf. Can be used to deal with over-fitting.
minGainToSplit (float) – The minimal gain to perform split
minSumHessianInLeaf (float) – Minimal sum hessian in one leaf
modelString (object) – LightGBM model to retrain
negBaggingFraction (float) – Negative Bagging fraction
numBatches (int) – If greater than 0, splits data into separate batches during training
numIterations (int) – Number of iterations, LightGBM constructs num_class * num_iterations trees
numLeaves (int) – Number of leaves
numTasks (int) – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
numThreads (int) – Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.
objective (object) – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
parallelism (object) – Tree learner parallelism, can be set to data_parallel or voting_parallel
posBaggingFraction (float) – Positive Bagging fraction
predictionCol (object) – prediction column name
probabilityCol (object) – Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities
rawPredictionCol (object) – raw prediction (a.k.a. confidence) column name
repartitionByGroupingColumn (bool) – Repartition training data according to grouping column, on by default.
skipDrop (float) – Probability of skipping the dropout procedure during a boosting iteration
slotNames (list) – List of slot names in the features column
thresholds (list) – Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold
timeout (float) – Timeout in seconds
topK (int) – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
uniformDrop (bool) – Set this to true to use uniform drop in dart mode
useBarrierExecutionMode (bool) – Barrier execution mode which uses a barrier stage, off by default.
useSingleDatasetMode (bool) – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.
validationIndicatorCol (object) – Indicates whether the row is for training or validation
verbosity (int) – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
weightCol (object) – The name of the weight column
xgboostDartMode (bool) – Set this to true to use xgboost dart mode

baggingFraction = Param(parent='undefined', name='baggingFraction', doc='Bagging fraction')

baggingFreq = Param(parent='undefined', name='baggingFreq', doc='Bagging frequency')

baggingSeed = Param(parent='undefined', name='baggingSeed', doc='Bagging seed')

binSampleCount = Param(parent='undefined', name='binSampleCount', doc='Number of samples considered at computing histogram bins')

boostFromAverage = Param(parent='undefined', name='boostFromAverage', doc='Adjusts initial score to the mean of labels for faster convergence')

boostingType = Param(parent='undefined', name='boostingType', doc='Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling). ')

categoricalSlotIndexes = Param(parent='undefined', name='categoricalSlotIndexes', doc='List of categorical column indexes, the slot index in the features column')

categoricalSlotNames = Param(parent='undefined', name='categoricalSlotNames', doc='List of categorical column slot names, the slot name in the features column')

chunkSize = Param(parent='undefined', name='chunkSize', doc='Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.')

defaultListenPort = Param(parent='undefined', name='defaultListenPort', doc='The default listen port on executors, used for testing')

driverListenPort = Param(parent='undefined', name='driverListenPort', doc='The listen port on a driver. Default value is 0 (random)')

dropRate = Param(parent='undefined', name='dropRate', doc='Dropout rate: a fraction of previous trees to drop during the dropout')

earlyStoppingRound = Param(parent='undefined', name='earlyStoppingRound', doc='Early stopping round')

featureFraction = Param(parent='undefined', name='featureFraction', doc='Feature fraction')

featuresCol = Param(parent='undefined', name='featuresCol', doc='features column name')

featuresShapCol = Param(parent='undefined', name='featuresShapCol', doc='Output SHAP vector column name after prediction containing the feature contribution values')

fobj = Param(parent='undefined', name='fobj', doc='Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).')

getBaggingFraction()[source]

Returns: Bagging fraction
Return type: baggingFraction

getBaggingFreq()[source]

Returns: Bagging frequency
Return type: baggingFreq

getBaggingSeed()[source]

Returns: Bagging seed
Return type: baggingSeed

getBinSampleCount()[source]

Returns: Number of samples considered at computing histogram bins
Return type: binSampleCount

getBoostFromAverage()[source]

Returns: Adjusts initial score to the mean of labels for faster convergence
Return type: boostFromAverage

getBoostingType()[source]

Returns: Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
Return type: boostingType

getCategoricalSlotIndexes()[source]

Returns: List of categorical column indexes, the slot index in the features column
Return type: categoricalSlotIndexes

getCategoricalSlotNames()[source]

Returns: List of categorical column slot names, the slot name in the features column
Return type: categoricalSlotNames

getChunkSize()[source]

Returns: Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
Return type: chunkSize

getDefaultListenPort()[source]

Returns: The default listen port on executors, used for testing
Return type: defaultListenPort

getDriverListenPort()[source]

Returns: The listen port on a driver. Default value is 0 (random)
Return type: driverListenPort

getDropRate()[source]

Returns: Dropout rate: a fraction of previous trees to drop during the dropout
Return type: dropRate

getEarlyStoppingRound()[source]

Returns: Early stopping round
Return type: earlyStoppingRound

getFeatureFraction()[source]

Returns: Feature fraction
Return type: featureFraction

getFeaturesCol()[source]

Returns: features column name
Return type: featuresCol

getFeaturesShapCol()[source]

Returns: Output SHAP vector column name after prediction containing the feature contribution values
Return type: featuresShapCol

getFobj()[source]

Returns: Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
Return type: fobj

getImprovementTolerance()[source]

Returns: Tolerance to consider improvement in metric
Return type: improvementTolerance

getInitScoreCol()[source]

Returns: The name of the initial score column, used for continued training
Return type: initScoreCol

getIsProvideTrainingMetric()[source]

Returns: Whether output metric result over training dataset.
Return type: isProvideTrainingMetric

getIsUnbalance()[source]

Returns: Set to true if training data is unbalanced in binary classification scenario
Return type: isUnbalance

static getJavaPackage()[source]: Returns package name String.

getLabelCol()[source]

Returns: label column name
Return type: labelCol

getLambdaL1()[source]

Returns: L1 regularization
Return type: lambdaL1

getLambdaL2()[source]

Returns: L2 regularization
Return type: lambdaL2

getLeafPredictionCol()[source]

Returns: Predicted leaf indices’s column name
Return type: leafPredictionCol

getLearningRate()[source]

Returns: Learning rate or shrinkage rate
Return type: learningRate

getMatrixType()[source]

Returns: Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
Return type: matrixType

getMaxBin()[source]

Returns: Max bin
Return type: maxBin

getMaxBinByFeature()[source]

Returns: Max number of bins for each feature
Return type: maxBinByFeature

getMaxDeltaStep()[source]

Returns: Used to limit the max output of tree leaves
Return type: maxDeltaStep

getMaxDepth()[source]

Returns: Max depth
Return type: maxDepth

getMaxDrop()[source]

Returns: Max number of dropped trees during one boosting iteration
Return type: maxDrop

getMetric()[source]

Returns: Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
Return type: metric

getMinDataInLeaf()[source]

Returns: Minimal number of data in one leaf. Can be used to deal with over-fitting.
Return type: minDataInLeaf

getMinGainToSplit()[source]

Returns: The minimal gain to perform split
Return type: minGainToSplit

getMinSumHessianInLeaf()[source]

Returns: Minimal sum hessian in one leaf
Return type: minSumHessianInLeaf

getModelString()[source]

Returns: LightGBM model to retrain
Return type: modelString

getNegBaggingFraction()[source]

Returns: Negative Bagging fraction
Return type: negBaggingFraction

getNumBatches()[source]

Returns: If greater than 0, splits data into separate batches during training
Return type: numBatches

getNumIterations()[source]

Returns: Number of iterations, LightGBM constructs num_class * num_iterations trees
Return type: numIterations

getNumLeaves()[source]

Returns: Number of leaves
Return type: numLeaves

getNumTasks()[source]

Returns: Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
Return type: numTasks

getNumThreads()[source]

Returns: Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.
Return type: numThreads

getObjective()[source]

Returns: The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
Return type: objective

getParallelism()[source]

Returns: Tree learner parallelism, can be set to data_parallel or voting_parallel
Return type: parallelism

getPosBaggingFraction()[source]

Returns: Positive Bagging fraction
Return type: posBaggingFraction

getPredictionCol()[source]

Returns: prediction column name
Return type: predictionCol

getProbabilityCol()[source]

Returns: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities
Return type: probabilityCol

getRawPredictionCol()[source]

Returns: raw prediction (a.k.a. confidence) column name
Return type: rawPredictionCol

getRepartitionByGroupingColumn()[source]

Returns: Repartition training data according to grouping column, on by default.
Return type: repartitionByGroupingColumn

getSkipDrop()[source]

Returns: Probability of skipping the dropout procedure during a boosting iteration
Return type: skipDrop

getSlotNames()[source]

Returns: List of slot names in the features column
Return type: slotNames

getThresholds()[source]

Returns: Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold
Return type: thresholds

getTimeout()[source]

Returns: Timeout in seconds
Return type: timeout

getTopK()[source]

Returns: The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
Return type: topK

getUniformDrop()[source]

Returns: Set this to true to use uniform drop in dart mode
Return type: uniformDrop

getUseBarrierExecutionMode()[source]

Returns: Barrier execution mode which uses a barrier stage, off by default.
Return type: useBarrierExecutionMode

getUseSingleDatasetMode()[source]

Returns: Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.
Return type: useSingleDatasetMode

getValidationIndicatorCol()[source]

Returns: Indicates whether the row is for training or validation
Return type: validationIndicatorCol

getVerbosity()[source]

Returns: Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
Return type: verbosity

getWeightCol()[source]

Returns: The name of the weight column
Return type: weightCol

getXgboostDartMode()[source]

Returns: Set this to true to use xgboost dart mode
Return type: xgboostDartMode

improvementTolerance = Param(parent='undefined', name='improvementTolerance', doc='Tolerance to consider improvement in metric')

initScoreCol = Param(parent='undefined', name='initScoreCol', doc='The name of the initial score column, used for continued training')

isProvideTrainingMetric = Param(parent='undefined', name='isProvideTrainingMetric', doc='Whether output metric result over training dataset.')

isUnbalance = Param(parent='undefined', name='isUnbalance', doc='Set to true if training data is unbalanced in binary classification scenario')

labelCol = Param(parent='undefined', name='labelCol', doc='label column name')

lambdaL1 = Param(parent='undefined', name='lambdaL1', doc='L1 regularization')

lambdaL2 = Param(parent='undefined', name='lambdaL2', doc='L2 regularization')

leafPredictionCol = Param(parent='undefined', name='leafPredictionCol', doc="Predicted leaf indices's column name")

learningRate = Param(parent='undefined', name='learningRate', doc='Learning rate or shrinkage rate')

matrixType = Param(parent='undefined', name='matrixType', doc='Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.')

maxBin = Param(parent='undefined', name='maxBin', doc='Max bin')

maxBinByFeature = Param(parent='undefined', name='maxBinByFeature', doc='Max number of bins for each feature')

maxDeltaStep = Param(parent='undefined', name='maxDeltaStep', doc='Used to limit the max output of tree leaves')

maxDepth = Param(parent='undefined', name='maxDepth', doc='Max depth')

maxDrop = Param(parent='undefined', name='maxDrop', doc='Max number of dropped trees during one boosting iteration')

metric = Param(parent='undefined', name='metric', doc='Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv. ')

minDataInLeaf = Param(parent='undefined', name='minDataInLeaf', doc='Minimal number of data in one leaf. Can be used to deal with over-fitting.')

minGainToSplit = Param(parent='undefined', name='minGainToSplit', doc='The minimal gain to perform split')

minSumHessianInLeaf = Param(parent='undefined', name='minSumHessianInLeaf', doc='Minimal sum hessian in one leaf')

modelString = Param(parent='undefined', name='modelString', doc='LightGBM model to retrain')

negBaggingFraction = Param(parent='undefined', name='negBaggingFraction', doc='Negative Bagging fraction')

numBatches = Param(parent='undefined', name='numBatches', doc='If greater than 0, splits data into separate batches during training')

numIterations = Param(parent='undefined', name='numIterations', doc='Number of iterations, LightGBM constructs num_class * num_iterations trees')

numLeaves = Param(parent='undefined', name='numLeaves', doc='Number of leaves')

numTasks = Param(parent='undefined', name='numTasks', doc='Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.')

numThreads = Param(parent='undefined', name='numThreads', doc='Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.')

objective = Param(parent='undefined', name='objective', doc='The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova. ')

parallelism = Param(parent='undefined', name='parallelism', doc='Tree learner parallelism, can be set to data_parallel or voting_parallel')

posBaggingFraction = Param(parent='undefined', name='posBaggingFraction', doc='Positive Bagging fraction')

predictionCol = Param(parent='undefined', name='predictionCol', doc='prediction column name')

probabilityCol = Param(parent='undefined', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities')

rawPredictionCol = Param(parent='undefined', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name')

classmethod read()[source]: Returns an MLReader instance for this class.

repartitionByGroupingColumn = Param(parent='undefined', name='repartitionByGroupingColumn', doc='Repartition training data according to grouping column, on by default.')

setBaggingFraction(value)[source]

Parameters: baggingFraction – Bagging fraction

setBaggingFreq(value)[source]

Parameters: baggingFreq – Bagging frequency

setBaggingSeed(value)[source]

Parameters: baggingSeed – Bagging seed

setBinSampleCount(value)[source]

Parameters: binSampleCount – Number of samples considered at computing histogram bins

setBoostFromAverage(value)[source]

Parameters: boostFromAverage – Adjusts initial score to the mean of labels for faster convergence

setBoostingType(value)[source]

Parameters: boostingType – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

setCategoricalSlotIndexes(value)[source]

Parameters: categoricalSlotIndexes – List of categorical column indexes, the slot index in the features column

setCategoricalSlotNames(value)[source]

Parameters: categoricalSlotNames – List of categorical column slot names, the slot name in the features column

setChunkSize(value)[source]

Parameters: chunkSize – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

setDefaultListenPort(value)[source]

Parameters: defaultListenPort – The default listen port on executors, used for testing

setDriverListenPort(value)[source]

Parameters: driverListenPort – The listen port on a driver. Default value is 0 (random)

setDropRate(value)[source]

Parameters: dropRate – Dropout rate: a fraction of previous trees to drop during the dropout

setEarlyStoppingRound(value)[source]

Parameters: earlyStoppingRound – Early stopping round

setFeatureFraction(value)[source]

Parameters: featureFraction – Feature fraction

setFeaturesCol(value)[source]

Parameters: featuresCol – features column name

setFeaturesShapCol(value)[source]

Parameters: featuresShapCol – Output SHAP vector column name after prediction containing the feature contribution values

setFobj(value)[source]

Parameters: fobj – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

setImprovementTolerance(value)[source]

Parameters: improvementTolerance – Tolerance to consider improvement in metric

setInitScoreCol(value)[source]

Parameters: initScoreCol – The name of the initial score column, used for continued training

setIsProvideTrainingMetric(value)[source]

Parameters: isProvideTrainingMetric – Whether output metric result over training dataset.

setIsUnbalance(value)[source]

Parameters: isUnbalance – Set to true if training data is unbalanced in binary classification scenario

setLabelCol(value)[source]

Parameters: labelCol – label column name

setLambdaL1(value)[source]

Parameters: lambdaL1 – L1 regularization

setLambdaL2(value)[source]

Parameters: lambdaL2 – L2 regularization

setLeafPredictionCol(value)[source]

Parameters: leafPredictionCol – Predicted leaf indices’s column name

setLearningRate(value)[source]

Parameters: learningRate – Learning rate or shrinkage rate

setMatrixType(value)[source]

Parameters: matrixType – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

setMaxBin(value)[source]

Parameters: maxBin – Max bin

setMaxBinByFeature(value)[source]

Parameters: maxBinByFeature – Max number of bins for each feature

setMaxDeltaStep(value)[source]

Parameters: maxDeltaStep – Used to limit the max output of tree leaves

setMaxDepth(value)[source]

Parameters: maxDepth – Max depth

setMaxDrop(value)[source]

Parameters: maxDrop – Max number of dropped trees during one boosting iteration

setMetric(value)[source]

Parameters: metric – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

setMinDataInLeaf(value)[source]

Parameters: minDataInLeaf – Minimal number of data in one leaf. Can be used to deal with over-fitting.

setMinGainToSplit(value)[source]

Parameters: minGainToSplit – The minimal gain to perform split

setMinSumHessianInLeaf(value)[source]

Parameters: minSumHessianInLeaf – Minimal sum hessian in one leaf

setModelString(value)[source]

Parameters: modelString – LightGBM model to retrain

setNegBaggingFraction(value)[source]

Parameters: negBaggingFraction – Negative Bagging fraction

setNumBatches(value)[source]

Parameters: numBatches – If greater than 0, splits data into separate batches during training

setNumIterations(value)[source]

Parameters: numIterations – Number of iterations, LightGBM constructs num_class * num_iterations trees

setNumLeaves(value)[source]

Parameters: numLeaves – Number of leaves

setNumTasks(value)[source]

Parameters: numTasks – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

setNumThreads(value)[source]

Parameters: numThreads – Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.

setObjective(value)[source]

Parameters: objective – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

setParallelism(value)[source]

Parameters: parallelism – Tree learner parallelism, can be set to data_parallel or voting_parallel

setParams(baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', categoricalSlotIndexes=[], categoricalSlotNames=[], chunkSize=10000, defaultListenPort=12400, driverListenPort=0, dropRate=0.1, earlyStoppingRound=0, featureFraction=1.0, featuresCol='features', featuresShapCol='', fobj=None, improvementTolerance=0.0, initScoreCol=None, isProvideTrainingMetric=False, isUnbalance=False, labelCol='label', lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, metric='', minDataInLeaf=20, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='binary', parallelism='data_parallel', posBaggingFraction=1.0, predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', repartitionByGroupingColumn=True, skipDrop=0.5, slotNames=[], thresholds=None, timeout=1200.0, topK=20, uniformDrop=False, useBarrierExecutionMode=False, useSingleDatasetMode=False, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xgboostDartMode=False)[source]: Set the (keyword only) parameters

setPosBaggingFraction(value)[source]

Parameters: posBaggingFraction – Positive Bagging fraction

setPredictionCol(value)[source]

Parameters: predictionCol – prediction column name

setProbabilityCol(value)[source]

Parameters: probabilityCol – Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities

setRawPredictionCol(value)[source]

Parameters: rawPredictionCol – raw prediction (a.k.a. confidence) column name

setRepartitionByGroupingColumn(value)[source]

Parameters: repartitionByGroupingColumn – Repartition training data according to grouping column, on by default.

setSkipDrop(value)[source]

Parameters: skipDrop – Probability of skipping the dropout procedure during a boosting iteration

setSlotNames(value)[source]

Parameters: slotNames – List of slot names in the features column

setThresholds(value)[source]

Parameters: thresholds – Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold

setTimeout(value)[source]

Parameters: timeout – Timeout in seconds

setTopK(value)[source]

Parameters: topK – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

setUniformDrop(value)[source]

Parameters: uniformDrop – Set this to true to use uniform drop in dart mode

setUseBarrierExecutionMode(value)[source]

Parameters: useBarrierExecutionMode – Barrier execution mode which uses a barrier stage, off by default.

setUseSingleDatasetMode(value)[source]

Parameters: useSingleDatasetMode – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.

setValidationIndicatorCol(value)[source]

Parameters: validationIndicatorCol – Indicates whether the row is for training or validation

setVerbosity(value)[source]

Parameters: verbosity – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

setWeightCol(value)[source]

Parameters: weightCol – The name of the weight column

setXgboostDartMode(value)[source]

Parameters: xgboostDartMode – Set this to true to use xgboost dart mode

skipDrop = Param(parent='undefined', name='skipDrop', doc='Probability of skipping the dropout procedure during a boosting iteration')

slotNames = Param(parent='undefined', name='slotNames', doc='List of slot names in the features column')

thresholds = Param(parent='undefined', name='thresholds', doc="Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold")

timeout = Param(parent='undefined', name='timeout', doc='Timeout in seconds')

topK = Param(parent='undefined', name='topK', doc='The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0')

uniformDrop = Param(parent='undefined', name='uniformDrop', doc='Set this to true to use uniform drop in dart mode')

useBarrierExecutionMode = Param(parent='undefined', name='useBarrierExecutionMode', doc='Barrier execution mode which uses a barrier stage, off by default.')

useSingleDatasetMode = Param(parent='undefined', name='useSingleDatasetMode', doc='Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.')

validationIndicatorCol = Param(parent='undefined', name='validationIndicatorCol', doc='Indicates whether the row is for training or validation')

verbosity = Param(parent='undefined', name='verbosity', doc='Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug')

weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')

xgboostDartMode = Param(parent='undefined', name='xgboostDartMode', doc='Set this to true to use xgboost dart mode')

synapse.ml.lightgbm.LightGBMRanker module

class synapse.ml.lightgbm.LightGBMRanker.LightGBMRanker(java_obj=None, baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', categoricalSlotIndexes=[], categoricalSlotNames=[], chunkSize=10000, defaultListenPort=12400, driverListenPort=0, dropRate=0.1, earlyStoppingRound=0, evalAt=[1, 2, 3, 4, 5], featureFraction=1.0, featuresCol='features', featuresShapCol='', fobj=None, groupCol=None, improvementTolerance=0.0, initScoreCol=None, isProvideTrainingMetric=False, labelCol='label', labelGain=[], lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxPosition=20, metric='', minDataInLeaf=20, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='lambdarank', parallelism='data_parallel', posBaggingFraction=1.0, predictionCol='prediction', repartitionByGroupingColumn=True, skipDrop=0.5, slotNames=[], timeout=1200.0, topK=20, uniformDrop=False, useBarrierExecutionMode=False, useSingleDatasetMode=False, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xgboostDartMode=False)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaEstimator

Parameters

baggingFraction (float) – Bagging fraction
baggingFreq (int) – Bagging frequency
baggingSeed (int) – Bagging seed
binSampleCount (int) – Number of samples considered at computing histogram bins
boostFromAverage (bool) – Adjusts initial score to the mean of labels for faster convergence
boostingType (object) – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
categoricalSlotIndexes (list) – List of categorical column indexes, the slot index in the features column
categoricalSlotNames (list) – List of categorical column slot names, the slot name in the features column
chunkSize (int) – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
defaultListenPort (int) – The default listen port on executors, used for testing
driverListenPort (int) – The listen port on a driver. Default value is 0 (random)
dropRate (float) – Dropout rate: a fraction of previous trees to drop during the dropout
earlyStoppingRound (int) – Early stopping round
evalAt (list) – NDCG and MAP evaluation positions, separated by comma
featureFraction (float) – Feature fraction
featuresCol (object) – features column name
featuresShapCol (object) – Output SHAP vector column name after prediction containing the feature contribution values
fobj (object) – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
groupCol (object) – The name of the group column
improvementTolerance (float) – Tolerance to consider improvement in metric
initScoreCol (object) – The name of the initial score column, used for continued training
isProvideTrainingMetric (bool) – Whether output metric result over training dataset.
labelCol (object) – label column name
labelGain (list) – graded relevance for each label in NDCG
lambdaL1 (float) – L1 regularization
lambdaL2 (float) – L2 regularization
leafPredictionCol (object) – Predicted leaf indices’s column name
learningRate (float) – Learning rate or shrinkage rate
matrixType (object) – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
maxBin (int) – Max bin
maxBinByFeature (list) – Max number of bins for each feature
maxDeltaStep (float) – Used to limit the max output of tree leaves
maxDepth (int) – Max depth
maxDrop (int) – Max number of dropped trees during one boosting iteration
maxPosition (int) – optimized NDCG at this position
metric (object) – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
minDataInLeaf (int) – Minimal number of data in one leaf. Can be used to deal with over-fitting.
minGainToSplit (float) – The minimal gain to perform split
minSumHessianInLeaf (float) – Minimal sum hessian in one leaf
modelString (object) – LightGBM model to retrain
negBaggingFraction (float) – Negative Bagging fraction
numBatches (int) – If greater than 0, splits data into separate batches during training
numIterations (int) – Number of iterations, LightGBM constructs num_class * num_iterations trees
numLeaves (int) – Number of leaves
numTasks (int) – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
numThreads (int) – Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.
objective (object) – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
parallelism (object) – Tree learner parallelism, can be set to data_parallel or voting_parallel
posBaggingFraction (float) – Positive Bagging fraction
predictionCol (object) – prediction column name
repartitionByGroupingColumn (bool) – Repartition training data according to grouping column, on by default.
skipDrop (float) – Probability of skipping the dropout procedure during a boosting iteration
slotNames (list) – List of slot names in the features column
timeout (float) – Timeout in seconds
topK (int) – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
uniformDrop (bool) – Set this to true to use uniform drop in dart mode
useBarrierExecutionMode (bool) – Barrier execution mode which uses a barrier stage, off by default.
useSingleDatasetMode (bool) – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.
validationIndicatorCol (object) – Indicates whether the row is for training or validation
verbosity (int) – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
weightCol (object) – The name of the weight column
xgboostDartMode (bool) – Set this to true to use xgboost dart mode

baggingFraction = Param(parent='undefined', name='baggingFraction', doc='Bagging fraction')

baggingFreq = Param(parent='undefined', name='baggingFreq', doc='Bagging frequency')

baggingSeed = Param(parent='undefined', name='baggingSeed', doc='Bagging seed')

binSampleCount = Param(parent='undefined', name='binSampleCount', doc='Number of samples considered at computing histogram bins')

boostFromAverage = Param(parent='undefined', name='boostFromAverage', doc='Adjusts initial score to the mean of labels for faster convergence')

boostingType = Param(parent='undefined', name='boostingType', doc='Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling). ')

categoricalSlotIndexes = Param(parent='undefined', name='categoricalSlotIndexes', doc='List of categorical column indexes, the slot index in the features column')

categoricalSlotNames = Param(parent='undefined', name='categoricalSlotNames', doc='List of categorical column slot names, the slot name in the features column')

chunkSize = Param(parent='undefined', name='chunkSize', doc='Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.')

defaultListenPort = Param(parent='undefined', name='defaultListenPort', doc='The default listen port on executors, used for testing')

driverListenPort = Param(parent='undefined', name='driverListenPort', doc='The listen port on a driver. Default value is 0 (random)')

dropRate = Param(parent='undefined', name='dropRate', doc='Dropout rate: a fraction of previous trees to drop during the dropout')

earlyStoppingRound = Param(parent='undefined', name='earlyStoppingRound', doc='Early stopping round')

evalAt = Param(parent='undefined', name='evalAt', doc='NDCG and MAP evaluation positions, separated by comma')

featureFraction = Param(parent='undefined', name='featureFraction', doc='Feature fraction')

featuresCol = Param(parent='undefined', name='featuresCol', doc='features column name')

featuresShapCol = Param(parent='undefined', name='featuresShapCol', doc='Output SHAP vector column name after prediction containing the feature contribution values')

fobj = Param(parent='undefined', name='fobj', doc='Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).')

getBaggingFraction()[source]

Returns: Bagging fraction
Return type: baggingFraction

getBaggingFreq()[source]

Returns: Bagging frequency
Return type: baggingFreq

getBaggingSeed()[source]

Returns: Bagging seed
Return type: baggingSeed

getBinSampleCount()[source]

Returns: Number of samples considered at computing histogram bins
Return type: binSampleCount

getBoostFromAverage()[source]

Returns: Adjusts initial score to the mean of labels for faster convergence
Return type: boostFromAverage

getBoostingType()[source]

Returns: Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
Return type: boostingType

getCategoricalSlotIndexes()[source]

Returns: List of categorical column indexes, the slot index in the features column
Return type: categoricalSlotIndexes

getCategoricalSlotNames()[source]

Returns: List of categorical column slot names, the slot name in the features column
Return type: categoricalSlotNames

getChunkSize()[source]

Returns: Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
Return type: chunkSize

getDefaultListenPort()[source]

Returns: The default listen port on executors, used for testing
Return type: defaultListenPort

getDriverListenPort()[source]

Returns: The listen port on a driver. Default value is 0 (random)
Return type: driverListenPort

getDropRate()[source]

Returns: Dropout rate: a fraction of previous trees to drop during the dropout
Return type: dropRate

getEarlyStoppingRound()[source]

Returns: Early stopping round
Return type: earlyStoppingRound

getEvalAt()[source]

Returns: NDCG and MAP evaluation positions, separated by comma
Return type: evalAt

getFeatureFraction()[source]

Returns: Feature fraction
Return type: featureFraction

getFeaturesCol()[source]

Returns: features column name
Return type: featuresCol

getFeaturesShapCol()[source]

Returns: Output SHAP vector column name after prediction containing the feature contribution values
Return type: featuresShapCol

getFobj()[source]

Returns: Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
Return type: fobj

getGroupCol()[source]

Returns: The name of the group column
Return type: groupCol

getImprovementTolerance()[source]

Returns: Tolerance to consider improvement in metric
Return type: improvementTolerance

getInitScoreCol()[source]

Returns: The name of the initial score column, used for continued training
Return type: initScoreCol

getIsProvideTrainingMetric()[source]

Returns: Whether output metric result over training dataset.
Return type: isProvideTrainingMetric

static getJavaPackage()[source]: Returns package name String.

getLabelCol()[source]

Returns: label column name
Return type: labelCol

getLabelGain()[source]

Returns: graded relevance for each label in NDCG
Return type: labelGain

getLambdaL1()[source]

Returns: L1 regularization
Return type: lambdaL1

getLambdaL2()[source]

Returns: L2 regularization
Return type: lambdaL2

getLeafPredictionCol()[source]

Returns: Predicted leaf indices’s column name
Return type: leafPredictionCol

getLearningRate()[source]

Returns: Learning rate or shrinkage rate
Return type: learningRate

getMatrixType()[source]

Returns: Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
Return type: matrixType

getMaxBin()[source]

Returns: Max bin
Return type: maxBin

getMaxBinByFeature()[source]

Returns: Max number of bins for each feature
Return type: maxBinByFeature

getMaxDeltaStep()[source]

Returns: Used to limit the max output of tree leaves
Return type: maxDeltaStep

getMaxDepth()[source]

Returns: Max depth
Return type: maxDepth

getMaxDrop()[source]

Returns: Max number of dropped trees during one boosting iteration
Return type: maxDrop

getMaxPosition()[source]

Returns: optimized NDCG at this position
Return type: maxPosition

getMetric()[source]

Returns: Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
Return type: metric

getMinDataInLeaf()[source]

Returns: Minimal number of data in one leaf. Can be used to deal with over-fitting.
Return type: minDataInLeaf

getMinGainToSplit()[source]

Returns: The minimal gain to perform split
Return type: minGainToSplit

getMinSumHessianInLeaf()[source]

Returns: Minimal sum hessian in one leaf
Return type: minSumHessianInLeaf

getModelString()[source]

Returns: LightGBM model to retrain
Return type: modelString

getNegBaggingFraction()[source]

Returns: Negative Bagging fraction
Return type: negBaggingFraction

getNumBatches()[source]

Returns: If greater than 0, splits data into separate batches during training
Return type: numBatches

getNumIterations()[source]

Returns: Number of iterations, LightGBM constructs num_class * num_iterations trees
Return type: numIterations

getNumLeaves()[source]

Returns: Number of leaves
Return type: numLeaves

getNumTasks()[source]

Returns: Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
Return type: numTasks

getNumThreads()[source]

Returns: Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.
Return type: numThreads

getObjective()[source]

Returns: The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
Return type: objective

getParallelism()[source]

Returns: Tree learner parallelism, can be set to data_parallel or voting_parallel
Return type: parallelism

getPosBaggingFraction()[source]

Returns: Positive Bagging fraction
Return type: posBaggingFraction

getPredictionCol()[source]

Returns: prediction column name
Return type: predictionCol

getRepartitionByGroupingColumn()[source]

Returns: Repartition training data according to grouping column, on by default.
Return type: repartitionByGroupingColumn

getSkipDrop()[source]

Returns: Probability of skipping the dropout procedure during a boosting iteration
Return type: skipDrop

getSlotNames()[source]

Returns: List of slot names in the features column
Return type: slotNames

getTimeout()[source]

Returns: Timeout in seconds
Return type: timeout

getTopK()[source]

Returns: The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
Return type: topK

getUniformDrop()[source]

Returns: Set this to true to use uniform drop in dart mode
Return type: uniformDrop

getUseBarrierExecutionMode()[source]

Returns: Barrier execution mode which uses a barrier stage, off by default.
Return type: useBarrierExecutionMode

getUseSingleDatasetMode()[source]

Returns: Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.
Return type: useSingleDatasetMode

getValidationIndicatorCol()[source]

Returns: Indicates whether the row is for training or validation
Return type: validationIndicatorCol

getVerbosity()[source]

Returns: Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
Return type: verbosity

getWeightCol()[source]

Returns: The name of the weight column
Return type: weightCol

getXgboostDartMode()[source]

Returns: Set this to true to use xgboost dart mode
Return type: xgboostDartMode

groupCol = Param(parent='undefined', name='groupCol', doc='The name of the group column')

improvementTolerance = Param(parent='undefined', name='improvementTolerance', doc='Tolerance to consider improvement in metric')

initScoreCol = Param(parent='undefined', name='initScoreCol', doc='The name of the initial score column, used for continued training')

isProvideTrainingMetric = Param(parent='undefined', name='isProvideTrainingMetric', doc='Whether output metric result over training dataset.')

labelCol = Param(parent='undefined', name='labelCol', doc='label column name')

labelGain = Param(parent='undefined', name='labelGain', doc='graded relevance for each label in NDCG')

lambdaL1 = Param(parent='undefined', name='lambdaL1', doc='L1 regularization')

lambdaL2 = Param(parent='undefined', name='lambdaL2', doc='L2 regularization')

leafPredictionCol = Param(parent='undefined', name='leafPredictionCol', doc="Predicted leaf indices's column name")

learningRate = Param(parent='undefined', name='learningRate', doc='Learning rate or shrinkage rate')

matrixType = Param(parent='undefined', name='matrixType', doc='Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.')

maxBin = Param(parent='undefined', name='maxBin', doc='Max bin')

maxBinByFeature = Param(parent='undefined', name='maxBinByFeature', doc='Max number of bins for each feature')

maxDeltaStep = Param(parent='undefined', name='maxDeltaStep', doc='Used to limit the max output of tree leaves')

maxDepth = Param(parent='undefined', name='maxDepth', doc='Max depth')

maxDrop = Param(parent='undefined', name='maxDrop', doc='Max number of dropped trees during one boosting iteration')

maxPosition = Param(parent='undefined', name='maxPosition', doc='optimized NDCG at this position')

metric = Param(parent='undefined', name='metric', doc='Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv. ')

minDataInLeaf = Param(parent='undefined', name='minDataInLeaf', doc='Minimal number of data in one leaf. Can be used to deal with over-fitting.')

minGainToSplit = Param(parent='undefined', name='minGainToSplit', doc='The minimal gain to perform split')

minSumHessianInLeaf = Param(parent='undefined', name='minSumHessianInLeaf', doc='Minimal sum hessian in one leaf')

modelString = Param(parent='undefined', name='modelString', doc='LightGBM model to retrain')

negBaggingFraction = Param(parent='undefined', name='negBaggingFraction', doc='Negative Bagging fraction')

numBatches = Param(parent='undefined', name='numBatches', doc='If greater than 0, splits data into separate batches during training')

numIterations = Param(parent='undefined', name='numIterations', doc='Number of iterations, LightGBM constructs num_class * num_iterations trees')

numLeaves = Param(parent='undefined', name='numLeaves', doc='Number of leaves')

numTasks = Param(parent='undefined', name='numTasks', doc='Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.')

numThreads = Param(parent='undefined', name='numThreads', doc='Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.')

objective = Param(parent='undefined', name='objective', doc='The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova. ')

parallelism = Param(parent='undefined', name='parallelism', doc='Tree learner parallelism, can be set to data_parallel or voting_parallel')

posBaggingFraction = Param(parent='undefined', name='posBaggingFraction', doc='Positive Bagging fraction')

predictionCol = Param(parent='undefined', name='predictionCol', doc='prediction column name')

classmethod read()[source]: Returns an MLReader instance for this class.

repartitionByGroupingColumn = Param(parent='undefined', name='repartitionByGroupingColumn', doc='Repartition training data according to grouping column, on by default.')

setBaggingFraction(value)[source]

Parameters: baggingFraction – Bagging fraction

setBaggingFreq(value)[source]

Parameters: baggingFreq – Bagging frequency

setBaggingSeed(value)[source]

Parameters: baggingSeed – Bagging seed

setBinSampleCount(value)[source]

Parameters: binSampleCount – Number of samples considered at computing histogram bins

setBoostFromAverage(value)[source]

Parameters: boostFromAverage – Adjusts initial score to the mean of labels for faster convergence

setBoostingType(value)[source]

Parameters: boostingType – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

setCategoricalSlotIndexes(value)[source]

Parameters: categoricalSlotIndexes – List of categorical column indexes, the slot index in the features column

setCategoricalSlotNames(value)[source]

Parameters: categoricalSlotNames – List of categorical column slot names, the slot name in the features column

setChunkSize(value)[source]

Parameters: chunkSize – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

setDefaultListenPort(value)[source]

Parameters: defaultListenPort – The default listen port on executors, used for testing

setDriverListenPort(value)[source]

Parameters: driverListenPort – The listen port on a driver. Default value is 0 (random)

setDropRate(value)[source]

Parameters: dropRate – Dropout rate: a fraction of previous trees to drop during the dropout

setEarlyStoppingRound(value)[source]

Parameters: earlyStoppingRound – Early stopping round

setEvalAt(value)[source]

Parameters: evalAt – NDCG and MAP evaluation positions, separated by comma

setFeatureFraction(value)[source]

Parameters: featureFraction – Feature fraction

setFeaturesCol(value)[source]

Parameters: featuresCol – features column name

setFeaturesShapCol(value)[source]

Parameters: featuresShapCol – Output SHAP vector column name after prediction containing the feature contribution values

setFobj(value)[source]

Parameters: fobj – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

setGroupCol(value)[source]

Parameters: groupCol – The name of the group column

setImprovementTolerance(value)[source]

Parameters: improvementTolerance – Tolerance to consider improvement in metric

setInitScoreCol(value)[source]

Parameters: initScoreCol – The name of the initial score column, used for continued training

setIsProvideTrainingMetric(value)[source]

Parameters: isProvideTrainingMetric – Whether output metric result over training dataset.

setLabelCol(value)[source]

Parameters: labelCol – label column name

setLabelGain(value)[source]

Parameters: labelGain – graded relevance for each label in NDCG

setLambdaL1(value)[source]

Parameters: lambdaL1 – L1 regularization

setLambdaL2(value)[source]

Parameters: lambdaL2 – L2 regularization

setLeafPredictionCol(value)[source]

Parameters: leafPredictionCol – Predicted leaf indices’s column name

setLearningRate(value)[source]

Parameters: learningRate – Learning rate or shrinkage rate

setMatrixType(value)[source]

Parameters: matrixType – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

setMaxBin(value)[source]

Parameters: maxBin – Max bin

setMaxBinByFeature(value)[source]

Parameters: maxBinByFeature – Max number of bins for each feature

setMaxDeltaStep(value)[source]

Parameters: maxDeltaStep – Used to limit the max output of tree leaves

setMaxDepth(value)[source]

Parameters: maxDepth – Max depth

setMaxDrop(value)[source]

Parameters: maxDrop – Max number of dropped trees during one boosting iteration

setMaxPosition(value)[source]

Parameters: maxPosition – optimized NDCG at this position

setMetric(value)[source]

Parameters: metric – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

setMinDataInLeaf(value)[source]

Parameters: minDataInLeaf – Minimal number of data in one leaf. Can be used to deal with over-fitting.

setMinGainToSplit(value)[source]

Parameters: minGainToSplit – The minimal gain to perform split

setMinSumHessianInLeaf(value)[source]

Parameters: minSumHessianInLeaf – Minimal sum hessian in one leaf

setModelString(value)[source]

Parameters: modelString – LightGBM model to retrain

setNegBaggingFraction(value)[source]

Parameters: negBaggingFraction – Negative Bagging fraction

setNumBatches(value)[source]

Parameters: numBatches – If greater than 0, splits data into separate batches during training

setNumIterations(value)[source]

Parameters: numIterations – Number of iterations, LightGBM constructs num_class * num_iterations trees

setNumLeaves(value)[source]

Parameters: numLeaves – Number of leaves

setNumTasks(value)[source]

Parameters: numTasks – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

setNumThreads(value)[source]

Parameters: numThreads – Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.

setObjective(value)[source]

Parameters: objective – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

setParallelism(value)[source]

Parameters: parallelism – Tree learner parallelism, can be set to data_parallel or voting_parallel

setParams(baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', categoricalSlotIndexes=[], categoricalSlotNames=[], chunkSize=10000, defaultListenPort=12400, driverListenPort=0, dropRate=0.1, earlyStoppingRound=0, evalAt=[1, 2, 3, 4, 5], featureFraction=1.0, featuresCol='features', featuresShapCol='', fobj=None, groupCol=None, improvementTolerance=0.0, initScoreCol=None, isProvideTrainingMetric=False, labelCol='label', labelGain=[], lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxPosition=20, metric='', minDataInLeaf=20, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='lambdarank', parallelism='data_parallel', posBaggingFraction=1.0, predictionCol='prediction', repartitionByGroupingColumn=True, skipDrop=0.5, slotNames=[], timeout=1200.0, topK=20, uniformDrop=False, useBarrierExecutionMode=False, useSingleDatasetMode=False, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xgboostDartMode=False)[source]: Set the (keyword only) parameters

setPosBaggingFraction(value)[source]

Parameters: posBaggingFraction – Positive Bagging fraction

setPredictionCol(value)[source]

Parameters: predictionCol – prediction column name

setRepartitionByGroupingColumn(value)[source]

Parameters: repartitionByGroupingColumn – Repartition training data according to grouping column, on by default.

setSkipDrop(value)[source]

Parameters: skipDrop – Probability of skipping the dropout procedure during a boosting iteration

setSlotNames(value)[source]

Parameters: slotNames – List of slot names in the features column

setTimeout(value)[source]

Parameters: timeout – Timeout in seconds

setTopK(value)[source]

Parameters: topK – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

setUniformDrop(value)[source]

Parameters: uniformDrop – Set this to true to use uniform drop in dart mode

setUseBarrierExecutionMode(value)[source]

Parameters: useBarrierExecutionMode – Barrier execution mode which uses a barrier stage, off by default.

setUseSingleDatasetMode(value)[source]

Parameters: useSingleDatasetMode – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.

setValidationIndicatorCol(value)[source]

Parameters: validationIndicatorCol – Indicates whether the row is for training or validation

setVerbosity(value)[source]

Parameters: verbosity – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

setWeightCol(value)[source]

Parameters: weightCol – The name of the weight column

setXgboostDartMode(value)[source]

Parameters: xgboostDartMode – Set this to true to use xgboost dart mode

skipDrop = Param(parent='undefined', name='skipDrop', doc='Probability of skipping the dropout procedure during a boosting iteration')

slotNames = Param(parent='undefined', name='slotNames', doc='List of slot names in the features column')

timeout = Param(parent='undefined', name='timeout', doc='Timeout in seconds')

topK = Param(parent='undefined', name='topK', doc='The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0')

uniformDrop = Param(parent='undefined', name='uniformDrop', doc='Set this to true to use uniform drop in dart mode')

useBarrierExecutionMode = Param(parent='undefined', name='useBarrierExecutionMode', doc='Barrier execution mode which uses a barrier stage, off by default.')

useSingleDatasetMode = Param(parent='undefined', name='useSingleDatasetMode', doc='Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.')

validationIndicatorCol = Param(parent='undefined', name='validationIndicatorCol', doc='Indicates whether the row is for training or validation')

verbosity = Param(parent='undefined', name='verbosity', doc='Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug')

weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')

xgboostDartMode = Param(parent='undefined', name='xgboostDartMode', doc='Set this to true to use xgboost dart mode')

synapse.ml.lightgbm.LightGBMRankerModel module

class synapse.ml.lightgbm.LightGBMRankerModel.LightGBMRankerModel(java_obj=None, featuresCol='features', featuresShapCol='', labelCol='label', leafPredictionCol='', lightGBMBooster=None, numIterations=- 1, predictionCol='prediction', startIteration=0)[source]

Bases: synapse.ml.lightgbm.mixin.LightGBMModelMixin, synapse.ml.lightgbm._LightGBMRankerModel._LightGBMRankerModel

getBoosterNumClasses()[source]

Get the number of classes from the booster.

Returns: The number of classes.

static loadNativeModelFromFile(filename)[source]: Load the model from a native LightGBM text file.

static loadNativeModelFromString(model)[source]: Load the model from a native LightGBM model string.

synapse.ml.lightgbm.LightGBMRegressionModel module

class synapse.ml.lightgbm.LightGBMRegressionModel.LightGBMRegressionModel(java_obj=None, featuresCol='features', featuresShapCol='', labelCol='label', leafPredictionCol='', lightGBMBooster=None, numIterations=- 1, predictionCol='prediction', startIteration=0)[source]

Bases: synapse.ml.lightgbm.mixin.LightGBMModelMixin, synapse.ml.lightgbm._LightGBMRegressionModel._LightGBMRegressionModel

static loadNativeModelFromFile(filename)[source]: Load the model from a native LightGBM text file.

static loadNativeModelFromString(model)[source]: Load the model from a native LightGBM model string.

synapse.ml.lightgbm.LightGBMRegressor module

class synapse.ml.lightgbm.LightGBMRegressor.LightGBMRegressor(java_obj=None, alpha=0.9, baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', categoricalSlotIndexes=[], categoricalSlotNames=[], chunkSize=10000, defaultListenPort=12400, driverListenPort=0, dropRate=0.1, earlyStoppingRound=0, featureFraction=1.0, featuresCol='features', featuresShapCol='', fobj=None, improvementTolerance=0.0, initScoreCol=None, isProvideTrainingMetric=False, labelCol='label', lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, metric='', minDataInLeaf=20, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='regression', parallelism='data_parallel', posBaggingFraction=1.0, predictionCol='prediction', repartitionByGroupingColumn=True, skipDrop=0.5, slotNames=[], timeout=1200.0, topK=20, tweedieVariancePower=1.5, uniformDrop=False, useBarrierExecutionMode=False, useSingleDatasetMode=False, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xgboostDartMode=False)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaEstimator

Parameters

alpha (float) – parameter for Huber loss and Quantile regression
baggingFraction (float) – Bagging fraction
baggingFreq (int) – Bagging frequency
baggingSeed (int) – Bagging seed
binSampleCount (int) – Number of samples considered at computing histogram bins
boostFromAverage (bool) – Adjusts initial score to the mean of labels for faster convergence
boostingType (object) – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
categoricalSlotIndexes (list) – List of categorical column indexes, the slot index in the features column
categoricalSlotNames (list) – List of categorical column slot names, the slot name in the features column
chunkSize (int) – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
defaultListenPort (int) – The default listen port on executors, used for testing
driverListenPort (int) – The listen port on a driver. Default value is 0 (random)
dropRate (float) – Dropout rate: a fraction of previous trees to drop during the dropout
earlyStoppingRound (int) – Early stopping round
featureFraction (float) – Feature fraction
featuresCol (object) – features column name
featuresShapCol (object) – Output SHAP vector column name after prediction containing the feature contribution values
fobj (object) – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
improvementTolerance (float) – Tolerance to consider improvement in metric
initScoreCol (object) – The name of the initial score column, used for continued training
isProvideTrainingMetric (bool) – Whether output metric result over training dataset.
labelCol (object) – label column name
lambdaL1 (float) – L1 regularization
lambdaL2 (float) – L2 regularization
leafPredictionCol (object) – Predicted leaf indices’s column name
learningRate (float) – Learning rate or shrinkage rate
matrixType (object) – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
maxBin (int) – Max bin
maxBinByFeature (list) – Max number of bins for each feature
maxDeltaStep (float) – Used to limit the max output of tree leaves
maxDepth (int) – Max depth
maxDrop (int) – Max number of dropped trees during one boosting iteration
metric (object) – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
minDataInLeaf (int) – Minimal number of data in one leaf. Can be used to deal with over-fitting.
minGainToSplit (float) – The minimal gain to perform split
minSumHessianInLeaf (float) – Minimal sum hessian in one leaf
modelString (object) – LightGBM model to retrain
negBaggingFraction (float) – Negative Bagging fraction
numBatches (int) – If greater than 0, splits data into separate batches during training
numIterations (int) – Number of iterations, LightGBM constructs num_class * num_iterations trees
numLeaves (int) – Number of leaves
numTasks (int) – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
numThreads (int) – Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.
objective (object) – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
parallelism (object) – Tree learner parallelism, can be set to data_parallel or voting_parallel
posBaggingFraction (float) – Positive Bagging fraction
predictionCol (object) – prediction column name
repartitionByGroupingColumn (bool) – Repartition training data according to grouping column, on by default.
skipDrop (float) – Probability of skipping the dropout procedure during a boosting iteration
slotNames (list) – List of slot names in the features column
timeout (float) – Timeout in seconds
topK (int) – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
tweedieVariancePower (float) – control the variance of tweedie distribution, must be between 1 and 2
uniformDrop (bool) – Set this to true to use uniform drop in dart mode
useBarrierExecutionMode (bool) – Barrier execution mode which uses a barrier stage, off by default.
useSingleDatasetMode (bool) – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.
validationIndicatorCol (object) – Indicates whether the row is for training or validation
verbosity (int) – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
weightCol (object) – The name of the weight column
xgboostDartMode (bool) – Set this to true to use xgboost dart mode

alpha = Param(parent='undefined', name='alpha', doc='parameter for Huber loss and Quantile regression')

baggingFraction = Param(parent='undefined', name='baggingFraction', doc='Bagging fraction')

baggingFreq = Param(parent='undefined', name='baggingFreq', doc='Bagging frequency')

baggingSeed = Param(parent='undefined', name='baggingSeed', doc='Bagging seed')

binSampleCount = Param(parent='undefined', name='binSampleCount', doc='Number of samples considered at computing histogram bins')

boostFromAverage = Param(parent='undefined', name='boostFromAverage', doc='Adjusts initial score to the mean of labels for faster convergence')

boostingType = Param(parent='undefined', name='boostingType', doc='Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling). ')

categoricalSlotIndexes = Param(parent='undefined', name='categoricalSlotIndexes', doc='List of categorical column indexes, the slot index in the features column')

categoricalSlotNames = Param(parent='undefined', name='categoricalSlotNames', doc='List of categorical column slot names, the slot name in the features column')

chunkSize = Param(parent='undefined', name='chunkSize', doc='Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.')

defaultListenPort = Param(parent='undefined', name='defaultListenPort', doc='The default listen port on executors, used for testing')

driverListenPort = Param(parent='undefined', name='driverListenPort', doc='The listen port on a driver. Default value is 0 (random)')

dropRate = Param(parent='undefined', name='dropRate', doc='Dropout rate: a fraction of previous trees to drop during the dropout')

earlyStoppingRound = Param(parent='undefined', name='earlyStoppingRound', doc='Early stopping round')

featureFraction = Param(parent='undefined', name='featureFraction', doc='Feature fraction')

featuresCol = Param(parent='undefined', name='featuresCol', doc='features column name')

featuresShapCol = Param(parent='undefined', name='featuresShapCol', doc='Output SHAP vector column name after prediction containing the feature contribution values')

fobj = Param(parent='undefined', name='fobj', doc='Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).')

getAlpha()[source]

Returns: parameter for Huber loss and Quantile regression
Return type: alpha

getBaggingFraction()[source]

Returns: Bagging fraction
Return type: baggingFraction

getBaggingFreq()[source]

Returns: Bagging frequency
Return type: baggingFreq

getBaggingSeed()[source]

Returns: Bagging seed
Return type: baggingSeed

getBinSampleCount()[source]

Returns: Number of samples considered at computing histogram bins
Return type: binSampleCount

getBoostFromAverage()[source]

Returns: Adjusts initial score to the mean of labels for faster convergence
Return type: boostFromAverage

getBoostingType()[source]

Returns: Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
Return type: boostingType

getCategoricalSlotIndexes()[source]

Returns: List of categorical column indexes, the slot index in the features column
Return type: categoricalSlotIndexes

getCategoricalSlotNames()[source]

Returns: List of categorical column slot names, the slot name in the features column
Return type: categoricalSlotNames

getChunkSize()[source]

Returns: Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
Return type: chunkSize

getDefaultListenPort()[source]

Returns: The default listen port on executors, used for testing
Return type: defaultListenPort

getDriverListenPort()[source]

Returns: The listen port on a driver. Default value is 0 (random)
Return type: driverListenPort

getDropRate()[source]

Returns: Dropout rate: a fraction of previous trees to drop during the dropout
Return type: dropRate

getEarlyStoppingRound()[source]

Returns: Early stopping round
Return type: earlyStoppingRound

getFeatureFraction()[source]

Returns: Feature fraction
Return type: featureFraction

getFeaturesCol()[source]

Returns: features column name
Return type: featuresCol

getFeaturesShapCol()[source]

Returns: Output SHAP vector column name after prediction containing the feature contribution values
Return type: featuresShapCol

getFobj()[source]

Returns: Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
Return type: fobj

getImprovementTolerance()[source]

Returns: Tolerance to consider improvement in metric
Return type: improvementTolerance

getInitScoreCol()[source]

Returns: The name of the initial score column, used for continued training
Return type: initScoreCol

getIsProvideTrainingMetric()[source]

Returns: Whether output metric result over training dataset.
Return type: isProvideTrainingMetric

static getJavaPackage()[source]: Returns package name String.

getLabelCol()[source]

Returns: label column name
Return type: labelCol

getLambdaL1()[source]

Returns: L1 regularization
Return type: lambdaL1

getLambdaL2()[source]

Returns: L2 regularization
Return type: lambdaL2

getLeafPredictionCol()[source]

Returns: Predicted leaf indices’s column name
Return type: leafPredictionCol

getLearningRate()[source]

Returns: Learning rate or shrinkage rate
Return type: learningRate

getMatrixType()[source]

Returns: Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
Return type: matrixType

getMaxBin()[source]

Returns: Max bin
Return type: maxBin

getMaxBinByFeature()[source]

Returns: Max number of bins for each feature
Return type: maxBinByFeature

getMaxDeltaStep()[source]

Returns: Used to limit the max output of tree leaves
Return type: maxDeltaStep

getMaxDepth()[source]

Returns: Max depth
Return type: maxDepth

getMaxDrop()[source]

Returns: Max number of dropped trees during one boosting iteration
Return type: maxDrop

getMetric()[source]

Returns: Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
Return type: metric

getMinDataInLeaf()[source]

Returns: Minimal number of data in one leaf. Can be used to deal with over-fitting.
Return type: minDataInLeaf

getMinGainToSplit()[source]

Returns: The minimal gain to perform split
Return type: minGainToSplit

getMinSumHessianInLeaf()[source]

Returns: Minimal sum hessian in one leaf
Return type: minSumHessianInLeaf

getModelString()[source]

Returns: LightGBM model to retrain
Return type: modelString

getNegBaggingFraction()[source]

Returns: Negative Bagging fraction
Return type: negBaggingFraction

getNumBatches()[source]

Returns: If greater than 0, splits data into separate batches during training
Return type: numBatches

getNumIterations()[source]

Returns: Number of iterations, LightGBM constructs num_class * num_iterations trees
Return type: numIterations

getNumLeaves()[source]

Returns: Number of leaves
Return type: numLeaves

getNumTasks()[source]

Returns: Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
Return type: numTasks

getNumThreads()[source]

Returns: Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.
Return type: numThreads

getObjective()[source]

Returns: The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
Return type: objective

getParallelism()[source]

Returns: Tree learner parallelism, can be set to data_parallel or voting_parallel
Return type: parallelism

getPosBaggingFraction()[source]

Returns: Positive Bagging fraction
Return type: posBaggingFraction

getPredictionCol()[source]

Returns: prediction column name
Return type: predictionCol

getRepartitionByGroupingColumn()[source]

Returns: Repartition training data according to grouping column, on by default.
Return type: repartitionByGroupingColumn

getSkipDrop()[source]

Returns: Probability of skipping the dropout procedure during a boosting iteration
Return type: skipDrop

getSlotNames()[source]

Returns: List of slot names in the features column
Return type: slotNames

getTimeout()[source]

Returns: Timeout in seconds
Return type: timeout

getTopK()[source]

Returns: The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
Return type: topK

getTweedieVariancePower()[source]

Returns: control the variance of tweedie distribution, must be between 1 and 2
Return type: tweedieVariancePower

getUniformDrop()[source]

Returns: Set this to true to use uniform drop in dart mode
Return type: uniformDrop

getUseBarrierExecutionMode()[source]

Returns: Barrier execution mode which uses a barrier stage, off by default.
Return type: useBarrierExecutionMode

getUseSingleDatasetMode()[source]

Returns: Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.
Return type: useSingleDatasetMode

getValidationIndicatorCol()[source]

Returns: Indicates whether the row is for training or validation
Return type: validationIndicatorCol

getVerbosity()[source]

Returns: Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
Return type: verbosity

getWeightCol()[source]

Returns: The name of the weight column
Return type: weightCol

getXgboostDartMode()[source]

Returns: Set this to true to use xgboost dart mode
Return type: xgboostDartMode

improvementTolerance = Param(parent='undefined', name='improvementTolerance', doc='Tolerance to consider improvement in metric')

initScoreCol = Param(parent='undefined', name='initScoreCol', doc='The name of the initial score column, used for continued training')

isProvideTrainingMetric = Param(parent='undefined', name='isProvideTrainingMetric', doc='Whether output metric result over training dataset.')

labelCol = Param(parent='undefined', name='labelCol', doc='label column name')

lambdaL1 = Param(parent='undefined', name='lambdaL1', doc='L1 regularization')

lambdaL2 = Param(parent='undefined', name='lambdaL2', doc='L2 regularization')

leafPredictionCol = Param(parent='undefined', name='leafPredictionCol', doc="Predicted leaf indices's column name")

learningRate = Param(parent='undefined', name='learningRate', doc='Learning rate or shrinkage rate')

matrixType = Param(parent='undefined', name='matrixType', doc='Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.')

maxBin = Param(parent='undefined', name='maxBin', doc='Max bin')

maxBinByFeature = Param(parent='undefined', name='maxBinByFeature', doc='Max number of bins for each feature')

maxDeltaStep = Param(parent='undefined', name='maxDeltaStep', doc='Used to limit the max output of tree leaves')

maxDepth = Param(parent='undefined', name='maxDepth', doc='Max depth')

maxDrop = Param(parent='undefined', name='maxDrop', doc='Max number of dropped trees during one boosting iteration')

metric = Param(parent='undefined', name='metric', doc='Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv. ')

minDataInLeaf = Param(parent='undefined', name='minDataInLeaf', doc='Minimal number of data in one leaf. Can be used to deal with over-fitting.')

minGainToSplit = Param(parent='undefined', name='minGainToSplit', doc='The minimal gain to perform split')

minSumHessianInLeaf = Param(parent='undefined', name='minSumHessianInLeaf', doc='Minimal sum hessian in one leaf')

modelString = Param(parent='undefined', name='modelString', doc='LightGBM model to retrain')

negBaggingFraction = Param(parent='undefined', name='negBaggingFraction', doc='Negative Bagging fraction')

numBatches = Param(parent='undefined', name='numBatches', doc='If greater than 0, splits data into separate batches during training')

numIterations = Param(parent='undefined', name='numIterations', doc='Number of iterations, LightGBM constructs num_class * num_iterations trees')

numLeaves = Param(parent='undefined', name='numLeaves', doc='Number of leaves')

numTasks = Param(parent='undefined', name='numTasks', doc='Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.')

numThreads = Param(parent='undefined', name='numThreads', doc='Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.')

objective = Param(parent='undefined', name='objective', doc='The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova. ')

parallelism = Param(parent='undefined', name='parallelism', doc='Tree learner parallelism, can be set to data_parallel or voting_parallel')

posBaggingFraction = Param(parent='undefined', name='posBaggingFraction', doc='Positive Bagging fraction')

predictionCol = Param(parent='undefined', name='predictionCol', doc='prediction column name')

classmethod read()[source]: Returns an MLReader instance for this class.

repartitionByGroupingColumn = Param(parent='undefined', name='repartitionByGroupingColumn', doc='Repartition training data according to grouping column, on by default.')

setAlpha(value)[source]

Parameters: alpha – parameter for Huber loss and Quantile regression

setBaggingFraction(value)[source]

Parameters: baggingFraction – Bagging fraction

setBaggingFreq(value)[source]

Parameters: baggingFreq – Bagging frequency

setBaggingSeed(value)[source]

Parameters: baggingSeed – Bagging seed

setBinSampleCount(value)[source]

Parameters: binSampleCount – Number of samples considered at computing histogram bins

setBoostFromAverage(value)[source]

Parameters: boostFromAverage – Adjusts initial score to the mean of labels for faster convergence

setBoostingType(value)[source]

Parameters: boostingType – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).

setCategoricalSlotIndexes(value)[source]

Parameters: categoricalSlotIndexes – List of categorical column indexes, the slot index in the features column

setCategoricalSlotNames(value)[source]

Parameters: categoricalSlotNames – List of categorical column slot names, the slot name in the features column

setChunkSize(value)[source]

Parameters: chunkSize – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.

setDefaultListenPort(value)[source]

Parameters: defaultListenPort – The default listen port on executors, used for testing

setDriverListenPort(value)[source]

Parameters: driverListenPort – The listen port on a driver. Default value is 0 (random)

setDropRate(value)[source]

Parameters: dropRate – Dropout rate: a fraction of previous trees to drop during the dropout

setEarlyStoppingRound(value)[source]

Parameters: earlyStoppingRound – Early stopping round

setFeatureFraction(value)[source]

Parameters: featureFraction – Feature fraction

setFeaturesCol(value)[source]

Parameters: featuresCol – features column name

setFeaturesShapCol(value)[source]

Parameters: featuresShapCol – Output SHAP vector column name after prediction containing the feature contribution values

setFobj(value)[source]

Parameters: fobj – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

setImprovementTolerance(value)[source]

Parameters: improvementTolerance – Tolerance to consider improvement in metric

setInitScoreCol(value)[source]

Parameters: initScoreCol – The name of the initial score column, used for continued training

setIsProvideTrainingMetric(value)[source]

Parameters: isProvideTrainingMetric – Whether output metric result over training dataset.

setLabelCol(value)[source]

Parameters: labelCol – label column name

setLambdaL1(value)[source]

Parameters: lambdaL1 – L1 regularization

setLambdaL2(value)[source]

Parameters: lambdaL2 – L2 regularization

setLeafPredictionCol(value)[source]

Parameters: leafPredictionCol – Predicted leaf indices’s column name

setLearningRate(value)[source]

Parameters: learningRate – Learning rate or shrinkage rate

setMatrixType(value)[source]

Parameters: matrixType – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.

setMaxBin(value)[source]

Parameters: maxBin – Max bin

setMaxBinByFeature(value)[source]

Parameters: maxBinByFeature – Max number of bins for each feature

setMaxDeltaStep(value)[source]

Parameters: maxDeltaStep – Used to limit the max output of tree leaves

setMaxDepth(value)[source]

Parameters: maxDepth – Max depth

setMaxDrop(value)[source]

Parameters: maxDrop – Max number of dropped trees during one boosting iteration

setMetric(value)[source]

Parameters: metric – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.

setMinDataInLeaf(value)[source]

Parameters: minDataInLeaf – Minimal number of data in one leaf. Can be used to deal with over-fitting.

setMinGainToSplit(value)[source]

Parameters: minGainToSplit – The minimal gain to perform split

setMinSumHessianInLeaf(value)[source]

Parameters: minSumHessianInLeaf – Minimal sum hessian in one leaf

setModelString(value)[source]

Parameters: modelString – LightGBM model to retrain

setNegBaggingFraction(value)[source]

Parameters: negBaggingFraction – Negative Bagging fraction

setNumBatches(value)[source]

Parameters: numBatches – If greater than 0, splits data into separate batches during training

setNumIterations(value)[source]

Parameters: numIterations – Number of iterations, LightGBM constructs num_class * num_iterations trees

setNumLeaves(value)[source]

Parameters: numLeaves – Number of leaves

setNumTasks(value)[source]

Parameters: numTasks – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.

setNumThreads(value)[source]

Parameters: numThreads – Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.

setObjective(value)[source]

Parameters: objective – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.

setParallelism(value)[source]

Parameters: parallelism – Tree learner parallelism, can be set to data_parallel or voting_parallel

setParams(alpha=0.9, baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', categoricalSlotIndexes=[], categoricalSlotNames=[], chunkSize=10000, defaultListenPort=12400, driverListenPort=0, dropRate=0.1, earlyStoppingRound=0, featureFraction=1.0, featuresCol='features', featuresShapCol='', fobj=None, improvementTolerance=0.0, initScoreCol=None, isProvideTrainingMetric=False, labelCol='label', lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, metric='', minDataInLeaf=20, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='regression', parallelism='data_parallel', posBaggingFraction=1.0, predictionCol='prediction', repartitionByGroupingColumn=True, skipDrop=0.5, slotNames=[], timeout=1200.0, topK=20, tweedieVariancePower=1.5, uniformDrop=False, useBarrierExecutionMode=False, useSingleDatasetMode=False, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xgboostDartMode=False)[source]: Set the (keyword only) parameters

setPosBaggingFraction(value)[source]

Parameters: posBaggingFraction – Positive Bagging fraction

setPredictionCol(value)[source]

Parameters: predictionCol – prediction column name

setRepartitionByGroupingColumn(value)[source]

Parameters: repartitionByGroupingColumn – Repartition training data according to grouping column, on by default.

setSkipDrop(value)[source]

Parameters: skipDrop – Probability of skipping the dropout procedure during a boosting iteration

setSlotNames(value)[source]

Parameters: slotNames – List of slot names in the features column

setTimeout(value)[source]

Parameters: timeout – Timeout in seconds

setTopK(value)[source]

Parameters: topK – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0

setTweedieVariancePower(value)[source]

Parameters: tweedieVariancePower – control the variance of tweedie distribution, must be between 1 and 2

setUniformDrop(value)[source]

Parameters: uniformDrop – Set this to true to use uniform drop in dart mode

setUseBarrierExecutionMode(value)[source]

Parameters: useBarrierExecutionMode – Barrier execution mode which uses a barrier stage, off by default.

setUseSingleDatasetMode(value)[source]

Parameters: useSingleDatasetMode – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.

setValidationIndicatorCol(value)[source]

Parameters: validationIndicatorCol – Indicates whether the row is for training or validation

setVerbosity(value)[source]

Parameters: verbosity – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug

setWeightCol(value)[source]

Parameters: weightCol – The name of the weight column

setXgboostDartMode(value)[source]

Parameters: xgboostDartMode – Set this to true to use xgboost dart mode

skipDrop = Param(parent='undefined', name='skipDrop', doc='Probability of skipping the dropout procedure during a boosting iteration')

slotNames = Param(parent='undefined', name='slotNames', doc='List of slot names in the features column')

timeout = Param(parent='undefined', name='timeout', doc='Timeout in seconds')

topK = Param(parent='undefined', name='topK', doc='The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0')

tweedieVariancePower = Param(parent='undefined', name='tweedieVariancePower', doc='control the variance of tweedie distribution, must be between 1 and 2')

uniformDrop = Param(parent='undefined', name='uniformDrop', doc='Set this to true to use uniform drop in dart mode')

useBarrierExecutionMode = Param(parent='undefined', name='useBarrierExecutionMode', doc='Barrier execution mode which uses a barrier stage, off by default.')

useSingleDatasetMode = Param(parent='undefined', name='useSingleDatasetMode', doc='Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.')

validationIndicatorCol = Param(parent='undefined', name='validationIndicatorCol', doc='Indicates whether the row is for training or validation')

verbosity = Param(parent='undefined', name='verbosity', doc='Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug')

weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')

xgboostDartMode = Param(parent='undefined', name='xgboostDartMode', doc='Set this to true to use xgboost dart mode')

synapse.ml.lightgbm.mixin module

class synapse.ml.lightgbm.mixin.LightGBMModelMixin[source]

Bases: object

getBoosterBestIteration()[source]

Get the best iteration from the booster.

Returns: The best iteration, if early stopping was triggered.

getBoosterNumFeatures()[source]

Get the number of features from the booster.

Returns: The number of features.

getBoosterNumTotalIterations()[source]

Get the total number of iterations trained.

Returns: The total number of iterations trained.

getBoosterNumTotalModel()[source]

Get the total number of models trained.

Returns: The total number of models.

getFeatureImportances(importance_type='split')[source]: Get the feature importances as a list. The importance_type can be “split” or “gain”.

getFeatureShaps(vector)[source]: Get the local shap feature importances.

saveNativeModel(filename, overwrite=True)[source]: Save the booster as string format to a local or WASB remote location.

Module contents

SynapseML is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark in several new directions. SynapseML adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV. These tools enable powerful and highly-scalable predictive and analytical models for a variety of datasources.

SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. In this vein, SynapseML provides easy to use SparkML transformers for a wide variety of Microsoft Cognitive Services. For production grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.

SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+.