synapse.ml.lightgbm package
Submodules
synapse.ml.lightgbm.LightGBMClassificationModel module
- class synapse.ml.lightgbm.LightGBMClassificationModel.LightGBMClassificationModel(java_obj=None, actualNumClasses=None, featuresCol='features', featuresShapCol='', labelCol='label', leafPredictionCol='', lightGBMBooster=None, numIterations=- 1, predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', startIteration=0, thresholds=None)[source]
Bases:
synapse.ml.lightgbm.mixin.LightGBMModelMixin
,synapse.ml.lightgbm._LightGBMClassificationModel._LightGBMClassificationModel
synapse.ml.lightgbm.LightGBMClassifier module
- class synapse.ml.lightgbm.LightGBMClassifier.LightGBMClassifier(java_obj=None, baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', categoricalSlotIndexes=[], categoricalSlotNames=[], chunkSize=10000, defaultListenPort=12400, driverListenPort=0, dropRate=0.1, earlyStoppingRound=0, featureFraction=1.0, featuresCol='features', featuresShapCol='', fobj=None, improvementTolerance=0.0, initScoreCol=None, isProvideTrainingMetric=False, isUnbalance=False, labelCol='label', lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, metric='', minDataInLeaf=20, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='binary', parallelism='data_parallel', posBaggingFraction=1.0, predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', repartitionByGroupingColumn=True, skipDrop=0.5, slotNames=[], thresholds=None, timeout=1200.0, topK=20, uniformDrop=False, useBarrierExecutionMode=False, useSingleDatasetMode=False, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xgboostDartMode=False)[source]
Bases:
synapse.ml.core.schema.Utils.ComplexParamsMixin
,pyspark.ml.util.JavaMLReadable
,pyspark.ml.util.JavaMLWritable
,pyspark.ml.wrapper.JavaEstimator
- Parameters
baggingFraction (float) – Bagging fraction
baggingFreq (int) – Bagging frequency
baggingSeed (int) – Bagging seed
binSampleCount (int) – Number of samples considered at computing histogram bins
boostFromAverage (bool) – Adjusts initial score to the mean of labels for faster convergence
boostingType (object) – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
categoricalSlotIndexes (list) – List of categorical column indexes, the slot index in the features column
categoricalSlotNames (list) – List of categorical column slot names, the slot name in the features column
chunkSize (int) – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
defaultListenPort (int) – The default listen port on executors, used for testing
driverListenPort (int) – The listen port on a driver. Default value is 0 (random)
dropRate (float) – Dropout rate: a fraction of previous trees to drop during the dropout
earlyStoppingRound (int) – Early stopping round
featureFraction (float) – Feature fraction
featuresCol (object) – features column name
featuresShapCol (object) – Output SHAP vector column name after prediction containing the feature contribution values
fobj (object) – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
improvementTolerance (float) – Tolerance to consider improvement in metric
initScoreCol (object) – The name of the initial score column, used for continued training
isProvideTrainingMetric (bool) – Whether output metric result over training dataset.
isUnbalance (bool) – Set to true if training data is unbalanced in binary classification scenario
labelCol (object) – label column name
lambdaL1 (float) – L1 regularization
lambdaL2 (float) – L2 regularization
leafPredictionCol (object) – Predicted leaf indices’s column name
learningRate (float) – Learning rate or shrinkage rate
matrixType (object) – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
maxBin (int) – Max bin
maxBinByFeature (list) – Max number of bins for each feature
maxDeltaStep (float) – Used to limit the max output of tree leaves
maxDepth (int) – Max depth
maxDrop (int) – Max number of dropped trees during one boosting iteration
metric (object) – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
minDataInLeaf (int) – Minimal number of data in one leaf. Can be used to deal with over-fitting.
minGainToSplit (float) – The minimal gain to perform split
minSumHessianInLeaf (float) – Minimal sum hessian in one leaf
modelString (object) – LightGBM model to retrain
negBaggingFraction (float) – Negative Bagging fraction
numBatches (int) – If greater than 0, splits data into separate batches during training
numIterations (int) – Number of iterations, LightGBM constructs num_class * num_iterations trees
numLeaves (int) – Number of leaves
numTasks (int) – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
numThreads (int) – Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.
objective (object) – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
parallelism (object) – Tree learner parallelism, can be set to data_parallel or voting_parallel
posBaggingFraction (float) – Positive Bagging fraction
predictionCol (object) – prediction column name
probabilityCol (object) – Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities
rawPredictionCol (object) – raw prediction (a.k.a. confidence) column name
repartitionByGroupingColumn (bool) – Repartition training data according to grouping column, on by default.
skipDrop (float) – Probability of skipping the dropout procedure during a boosting iteration
slotNames (list) – List of slot names in the features column
thresholds (list) – Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold
timeout (float) – Timeout in seconds
topK (int) – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
uniformDrop (bool) – Set this to true to use uniform drop in dart mode
useBarrierExecutionMode (bool) – Barrier execution mode which uses a barrier stage, off by default.
useSingleDatasetMode (bool) – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.
validationIndicatorCol (object) – Indicates whether the row is for training or validation
verbosity (int) – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
weightCol (object) – The name of the weight column
xgboostDartMode (bool) – Set this to true to use xgboost dart mode
- baggingFraction = Param(parent='undefined', name='baggingFraction', doc='Bagging fraction')
- baggingFreq = Param(parent='undefined', name='baggingFreq', doc='Bagging frequency')
- baggingSeed = Param(parent='undefined', name='baggingSeed', doc='Bagging seed')
- binSampleCount = Param(parent='undefined', name='binSampleCount', doc='Number of samples considered at computing histogram bins')
- boostFromAverage = Param(parent='undefined', name='boostFromAverage', doc='Adjusts initial score to the mean of labels for faster convergence')
- boostingType = Param(parent='undefined', name='boostingType', doc='Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling). ')
- categoricalSlotIndexes = Param(parent='undefined', name='categoricalSlotIndexes', doc='List of categorical column indexes, the slot index in the features column')
- categoricalSlotNames = Param(parent='undefined', name='categoricalSlotNames', doc='List of categorical column slot names, the slot name in the features column')
- chunkSize = Param(parent='undefined', name='chunkSize', doc='Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.')
- defaultListenPort = Param(parent='undefined', name='defaultListenPort', doc='The default listen port on executors, used for testing')
- driverListenPort = Param(parent='undefined', name='driverListenPort', doc='The listen port on a driver. Default value is 0 (random)')
- dropRate = Param(parent='undefined', name='dropRate', doc='Dropout rate: a fraction of previous trees to drop during the dropout')
- earlyStoppingRound = Param(parent='undefined', name='earlyStoppingRound', doc='Early stopping round')
- featureFraction = Param(parent='undefined', name='featureFraction', doc='Feature fraction')
- featuresCol = Param(parent='undefined', name='featuresCol', doc='features column name')
- featuresShapCol = Param(parent='undefined', name='featuresShapCol', doc='Output SHAP vector column name after prediction containing the feature contribution values')
- fobj = Param(parent='undefined', name='fobj', doc='Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).')
- getBinSampleCount()[source]
- Returns
Number of samples considered at computing histogram bins
- Return type
binSampleCount
- getBoostFromAverage()[source]
- Returns
Adjusts initial score to the mean of labels for faster convergence
- Return type
boostFromAverage
- getBoostingType()[source]
- Returns
Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
- Return type
boostingType
- getCategoricalSlotIndexes()[source]
- Returns
List of categorical column indexes, the slot index in the features column
- Return type
categoricalSlotIndexes
- getCategoricalSlotNames()[source]
- Returns
List of categorical column slot names, the slot name in the features column
- Return type
categoricalSlotNames
- getChunkSize()[source]
- Returns
Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
- Return type
chunkSize
- getDefaultListenPort()[source]
- Returns
The default listen port on executors, used for testing
- Return type
defaultListenPort
- getDriverListenPort()[source]
- Returns
The listen port on a driver. Default value is 0 (random)
- Return type
driverListenPort
- getDropRate()[source]
- Returns
Dropout rate: a fraction of previous trees to drop during the dropout
- Return type
dropRate
- getFeaturesShapCol()[source]
- Returns
Output SHAP vector column name after prediction containing the feature contribution values
- Return type
featuresShapCol
- getFobj()[source]
- Returns
Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
- Return type
fobj
- getImprovementTolerance()[source]
- Returns
Tolerance to consider improvement in metric
- Return type
improvementTolerance
- getInitScoreCol()[source]
- Returns
The name of the initial score column, used for continued training
- Return type
initScoreCol
- getIsProvideTrainingMetric()[source]
- Returns
Whether output metric result over training dataset.
- Return type
isProvideTrainingMetric
- getIsUnbalance()[source]
- Returns
Set to true if training data is unbalanced in binary classification scenario
- Return type
isUnbalance
- getLeafPredictionCol()[source]
- Returns
Predicted leaf indices’s column name
- Return type
leafPredictionCol
- getMatrixType()[source]
- Returns
Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
- Return type
matrixType
- getMaxBinByFeature()[source]
- Returns
Max number of bins for each feature
- Return type
maxBinByFeature
- getMaxDeltaStep()[source]
- Returns
Used to limit the max output of tree leaves
- Return type
maxDeltaStep
- getMaxDrop()[source]
- Returns
Max number of dropped trees during one boosting iteration
- Return type
maxDrop
- getMetric()[source]
- Returns
Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
- Return type
metric
- getMinDataInLeaf()[source]
- Returns
Minimal number of data in one leaf. Can be used to deal with over-fitting.
- Return type
minDataInLeaf
- getMinSumHessianInLeaf()[source]
- Returns
Minimal sum hessian in one leaf
- Return type
minSumHessianInLeaf
- getNumBatches()[source]
- Returns
If greater than 0, splits data into separate batches during training
- Return type
numBatches
- getNumIterations()[source]
- Returns
Number of iterations, LightGBM constructs num_class * num_iterations trees
- Return type
numIterations
- getNumTasks()[source]
- Returns
Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
- Return type
numTasks
- getNumThreads()[source]
- Returns
Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.
- Return type
numThreads
- getObjective()[source]
- Returns
The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
- Return type
objective
- getParallelism()[source]
- Returns
Tree learner parallelism, can be set to data_parallel or voting_parallel
- Return type
parallelism
- getProbabilityCol()[source]
- Returns
Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities
- Return type
probabilityCol
- getRawPredictionCol()[source]
- Returns
raw prediction (a.k.a. confidence) column name
- Return type
rawPredictionCol
- getRepartitionByGroupingColumn()[source]
- Returns
Repartition training data according to grouping column, on by default.
- Return type
repartitionByGroupingColumn
- getSkipDrop()[source]
- Returns
Probability of skipping the dropout procedure during a boosting iteration
- Return type
skipDrop
- getThresholds()[source]
- Returns
Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold
- Return type
thresholds
- getTopK()[source]
- Returns
The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
- Return type
topK
- getUniformDrop()[source]
- Returns
Set this to true to use uniform drop in dart mode
- Return type
uniformDrop
- getUseBarrierExecutionMode()[source]
- Returns
Barrier execution mode which uses a barrier stage, off by default.
- Return type
useBarrierExecutionMode
- getUseSingleDatasetMode()[source]
- Returns
Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.
- Return type
useSingleDatasetMode
- getValidationIndicatorCol()[source]
- Returns
Indicates whether the row is for training or validation
- Return type
validationIndicatorCol
- getVerbosity()[source]
- Returns
Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
- Return type
verbosity
- getXgboostDartMode()[source]
- Returns
Set this to true to use xgboost dart mode
- Return type
xgboostDartMode
- improvementTolerance = Param(parent='undefined', name='improvementTolerance', doc='Tolerance to consider improvement in metric')
- initScoreCol = Param(parent='undefined', name='initScoreCol', doc='The name of the initial score column, used for continued training')
- isProvideTrainingMetric = Param(parent='undefined', name='isProvideTrainingMetric', doc='Whether output metric result over training dataset.')
- isUnbalance = Param(parent='undefined', name='isUnbalance', doc='Set to true if training data is unbalanced in binary classification scenario')
- labelCol = Param(parent='undefined', name='labelCol', doc='label column name')
- lambdaL1 = Param(parent='undefined', name='lambdaL1', doc='L1 regularization')
- lambdaL2 = Param(parent='undefined', name='lambdaL2', doc='L2 regularization')
- leafPredictionCol = Param(parent='undefined', name='leafPredictionCol', doc="Predicted leaf indices's column name")
- learningRate = Param(parent='undefined', name='learningRate', doc='Learning rate or shrinkage rate')
- matrixType = Param(parent='undefined', name='matrixType', doc='Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.')
- maxBin = Param(parent='undefined', name='maxBin', doc='Max bin')
- maxBinByFeature = Param(parent='undefined', name='maxBinByFeature', doc='Max number of bins for each feature')
- maxDeltaStep = Param(parent='undefined', name='maxDeltaStep', doc='Used to limit the max output of tree leaves')
- maxDepth = Param(parent='undefined', name='maxDepth', doc='Max depth')
- maxDrop = Param(parent='undefined', name='maxDrop', doc='Max number of dropped trees during one boosting iteration')
- metric = Param(parent='undefined', name='metric', doc='Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv. ')
- minDataInLeaf = Param(parent='undefined', name='minDataInLeaf', doc='Minimal number of data in one leaf. Can be used to deal with over-fitting.')
- minGainToSplit = Param(parent='undefined', name='minGainToSplit', doc='The minimal gain to perform split')
- minSumHessianInLeaf = Param(parent='undefined', name='minSumHessianInLeaf', doc='Minimal sum hessian in one leaf')
- modelString = Param(parent='undefined', name='modelString', doc='LightGBM model to retrain')
- negBaggingFraction = Param(parent='undefined', name='negBaggingFraction', doc='Negative Bagging fraction')
- numBatches = Param(parent='undefined', name='numBatches', doc='If greater than 0, splits data into separate batches during training')
- numIterations = Param(parent='undefined', name='numIterations', doc='Number of iterations, LightGBM constructs num_class * num_iterations trees')
- numLeaves = Param(parent='undefined', name='numLeaves', doc='Number of leaves')
- numTasks = Param(parent='undefined', name='numTasks', doc='Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.')
- numThreads = Param(parent='undefined', name='numThreads', doc='Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.')
- objective = Param(parent='undefined', name='objective', doc='The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova. ')
- parallelism = Param(parent='undefined', name='parallelism', doc='Tree learner parallelism, can be set to data_parallel or voting_parallel')
- posBaggingFraction = Param(parent='undefined', name='posBaggingFraction', doc='Positive Bagging fraction')
- predictionCol = Param(parent='undefined', name='predictionCol', doc='prediction column name')
- probabilityCol = Param(parent='undefined', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities')
- rawPredictionCol = Param(parent='undefined', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name')
- repartitionByGroupingColumn = Param(parent='undefined', name='repartitionByGroupingColumn', doc='Repartition training data according to grouping column, on by default.')
- setBinSampleCount(value)[source]
- Parameters
binSampleCount – Number of samples considered at computing histogram bins
- setBoostFromAverage(value)[source]
- Parameters
boostFromAverage – Adjusts initial score to the mean of labels for faster convergence
- setBoostingType(value)[source]
- Parameters
boostingType – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
- setCategoricalSlotIndexes(value)[source]
- Parameters
categoricalSlotIndexes – List of categorical column indexes, the slot index in the features column
- setCategoricalSlotNames(value)[source]
- Parameters
categoricalSlotNames – List of categorical column slot names, the slot name in the features column
- setChunkSize(value)[source]
- Parameters
chunkSize – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
- setDefaultListenPort(value)[source]
- Parameters
defaultListenPort – The default listen port on executors, used for testing
- setDriverListenPort(value)[source]
- Parameters
driverListenPort – The listen port on a driver. Default value is 0 (random)
- setDropRate(value)[source]
- Parameters
dropRate – Dropout rate: a fraction of previous trees to drop during the dropout
- setFeaturesShapCol(value)[source]
- Parameters
featuresShapCol – Output SHAP vector column name after prediction containing the feature contribution values
- setFobj(value)[source]
- Parameters
fobj – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
- setImprovementTolerance(value)[source]
- Parameters
improvementTolerance – Tolerance to consider improvement in metric
- setInitScoreCol(value)[source]
- Parameters
initScoreCol – The name of the initial score column, used for continued training
- setIsProvideTrainingMetric(value)[source]
- Parameters
isProvideTrainingMetric – Whether output metric result over training dataset.
- setIsUnbalance(value)[source]
- Parameters
isUnbalance – Set to true if training data is unbalanced in binary classification scenario
- setLeafPredictionCol(value)[source]
- Parameters
leafPredictionCol – Predicted leaf indices’s column name
- setMatrixType(value)[source]
- Parameters
matrixType – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
- setMaxDeltaStep(value)[source]
- Parameters
maxDeltaStep – Used to limit the max output of tree leaves
- setMaxDrop(value)[source]
- Parameters
maxDrop – Max number of dropped trees during one boosting iteration
- setMetric(value)[source]
- Parameters
metric – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
- setMinDataInLeaf(value)[source]
- Parameters
minDataInLeaf – Minimal number of data in one leaf. Can be used to deal with over-fitting.
- setMinSumHessianInLeaf(value)[source]
- Parameters
minSumHessianInLeaf – Minimal sum hessian in one leaf
- setNumBatches(value)[source]
- Parameters
numBatches – If greater than 0, splits data into separate batches during training
- setNumIterations(value)[source]
- Parameters
numIterations – Number of iterations, LightGBM constructs num_class * num_iterations trees
- setNumTasks(value)[source]
- Parameters
numTasks – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
- setNumThreads(value)[source]
- Parameters
numThreads – Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.
- setObjective(value)[source]
- Parameters
objective – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
- setParallelism(value)[source]
- Parameters
parallelism – Tree learner parallelism, can be set to data_parallel or voting_parallel
- setParams(baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', categoricalSlotIndexes=[], categoricalSlotNames=[], chunkSize=10000, defaultListenPort=12400, driverListenPort=0, dropRate=0.1, earlyStoppingRound=0, featureFraction=1.0, featuresCol='features', featuresShapCol='', fobj=None, improvementTolerance=0.0, initScoreCol=None, isProvideTrainingMetric=False, isUnbalance=False, labelCol='label', lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, metric='', minDataInLeaf=20, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='binary', parallelism='data_parallel', posBaggingFraction=1.0, predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', repartitionByGroupingColumn=True, skipDrop=0.5, slotNames=[], thresholds=None, timeout=1200.0, topK=20, uniformDrop=False, useBarrierExecutionMode=False, useSingleDatasetMode=False, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xgboostDartMode=False)[source]
Set the (keyword only) parameters
- setProbabilityCol(value)[source]
- Parameters
probabilityCol – Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities
- setRawPredictionCol(value)[source]
- Parameters
rawPredictionCol – raw prediction (a.k.a. confidence) column name
- setRepartitionByGroupingColumn(value)[source]
- Parameters
repartitionByGroupingColumn – Repartition training data according to grouping column, on by default.
- setSkipDrop(value)[source]
- Parameters
skipDrop – Probability of skipping the dropout procedure during a boosting iteration
- setThresholds(value)[source]
- Parameters
thresholds – Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold
- setTopK(value)[source]
- Parameters
topK – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
- setUniformDrop(value)[source]
- Parameters
uniformDrop – Set this to true to use uniform drop in dart mode
- setUseBarrierExecutionMode(value)[source]
- Parameters
useBarrierExecutionMode – Barrier execution mode which uses a barrier stage, off by default.
- setUseSingleDatasetMode(value)[source]
- Parameters
useSingleDatasetMode – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.
- setValidationIndicatorCol(value)[source]
- Parameters
validationIndicatorCol – Indicates whether the row is for training or validation
- setVerbosity(value)[source]
- Parameters
verbosity – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
- setXgboostDartMode(value)[source]
- Parameters
xgboostDartMode – Set this to true to use xgboost dart mode
- skipDrop = Param(parent='undefined', name='skipDrop', doc='Probability of skipping the dropout procedure during a boosting iteration')
- slotNames = Param(parent='undefined', name='slotNames', doc='List of slot names in the features column')
- thresholds = Param(parent='undefined', name='thresholds', doc="Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold")
- timeout = Param(parent='undefined', name='timeout', doc='Timeout in seconds')
- topK = Param(parent='undefined', name='topK', doc='The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0')
- uniformDrop = Param(parent='undefined', name='uniformDrop', doc='Set this to true to use uniform drop in dart mode')
- useBarrierExecutionMode = Param(parent='undefined', name='useBarrierExecutionMode', doc='Barrier execution mode which uses a barrier stage, off by default.')
- useSingleDatasetMode = Param(parent='undefined', name='useSingleDatasetMode', doc='Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.')
- validationIndicatorCol = Param(parent='undefined', name='validationIndicatorCol', doc='Indicates whether the row is for training or validation')
- verbosity = Param(parent='undefined', name='verbosity', doc='Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug')
- weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')
- xgboostDartMode = Param(parent='undefined', name='xgboostDartMode', doc='Set this to true to use xgboost dart mode')
synapse.ml.lightgbm.LightGBMRanker module
- class synapse.ml.lightgbm.LightGBMRanker.LightGBMRanker(java_obj=None, baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', categoricalSlotIndexes=[], categoricalSlotNames=[], chunkSize=10000, defaultListenPort=12400, driverListenPort=0, dropRate=0.1, earlyStoppingRound=0, evalAt=[1, 2, 3, 4, 5], featureFraction=1.0, featuresCol='features', featuresShapCol='', fobj=None, groupCol=None, improvementTolerance=0.0, initScoreCol=None, isProvideTrainingMetric=False, labelCol='label', labelGain=[], lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxPosition=20, metric='', minDataInLeaf=20, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='lambdarank', parallelism='data_parallel', posBaggingFraction=1.0, predictionCol='prediction', repartitionByGroupingColumn=True, skipDrop=0.5, slotNames=[], timeout=1200.0, topK=20, uniformDrop=False, useBarrierExecutionMode=False, useSingleDatasetMode=False, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xgboostDartMode=False)[source]
Bases:
synapse.ml.core.schema.Utils.ComplexParamsMixin
,pyspark.ml.util.JavaMLReadable
,pyspark.ml.util.JavaMLWritable
,pyspark.ml.wrapper.JavaEstimator
- Parameters
baggingFraction (float) – Bagging fraction
baggingFreq (int) – Bagging frequency
baggingSeed (int) – Bagging seed
binSampleCount (int) – Number of samples considered at computing histogram bins
boostFromAverage (bool) – Adjusts initial score to the mean of labels for faster convergence
boostingType (object) – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
categoricalSlotIndexes (list) – List of categorical column indexes, the slot index in the features column
categoricalSlotNames (list) – List of categorical column slot names, the slot name in the features column
chunkSize (int) – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
defaultListenPort (int) – The default listen port on executors, used for testing
driverListenPort (int) – The listen port on a driver. Default value is 0 (random)
dropRate (float) – Dropout rate: a fraction of previous trees to drop during the dropout
earlyStoppingRound (int) – Early stopping round
evalAt (list) – NDCG and MAP evaluation positions, separated by comma
featureFraction (float) – Feature fraction
featuresCol (object) – features column name
featuresShapCol (object) – Output SHAP vector column name after prediction containing the feature contribution values
fobj (object) – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
groupCol (object) – The name of the group column
improvementTolerance (float) – Tolerance to consider improvement in metric
initScoreCol (object) – The name of the initial score column, used for continued training
isProvideTrainingMetric (bool) – Whether output metric result over training dataset.
labelCol (object) – label column name
labelGain (list) – graded relevance for each label in NDCG
lambdaL1 (float) – L1 regularization
lambdaL2 (float) – L2 regularization
leafPredictionCol (object) – Predicted leaf indices’s column name
learningRate (float) – Learning rate or shrinkage rate
matrixType (object) – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
maxBin (int) – Max bin
maxBinByFeature (list) – Max number of bins for each feature
maxDeltaStep (float) – Used to limit the max output of tree leaves
maxDepth (int) – Max depth
maxDrop (int) – Max number of dropped trees during one boosting iteration
maxPosition (int) – optimized NDCG at this position
metric (object) – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
minDataInLeaf (int) – Minimal number of data in one leaf. Can be used to deal with over-fitting.
minGainToSplit (float) – The minimal gain to perform split
minSumHessianInLeaf (float) – Minimal sum hessian in one leaf
modelString (object) – LightGBM model to retrain
negBaggingFraction (float) – Negative Bagging fraction
numBatches (int) – If greater than 0, splits data into separate batches during training
numIterations (int) – Number of iterations, LightGBM constructs num_class * num_iterations trees
numLeaves (int) – Number of leaves
numTasks (int) – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
numThreads (int) – Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.
objective (object) – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
parallelism (object) – Tree learner parallelism, can be set to data_parallel or voting_parallel
posBaggingFraction (float) – Positive Bagging fraction
predictionCol (object) – prediction column name
repartitionByGroupingColumn (bool) – Repartition training data according to grouping column, on by default.
skipDrop (float) – Probability of skipping the dropout procedure during a boosting iteration
slotNames (list) – List of slot names in the features column
timeout (float) – Timeout in seconds
topK (int) – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
uniformDrop (bool) – Set this to true to use uniform drop in dart mode
useBarrierExecutionMode (bool) – Barrier execution mode which uses a barrier stage, off by default.
useSingleDatasetMode (bool) – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.
validationIndicatorCol (object) – Indicates whether the row is for training or validation
verbosity (int) – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
weightCol (object) – The name of the weight column
xgboostDartMode (bool) – Set this to true to use xgboost dart mode
- baggingFraction = Param(parent='undefined', name='baggingFraction', doc='Bagging fraction')
- baggingFreq = Param(parent='undefined', name='baggingFreq', doc='Bagging frequency')
- baggingSeed = Param(parent='undefined', name='baggingSeed', doc='Bagging seed')
- binSampleCount = Param(parent='undefined', name='binSampleCount', doc='Number of samples considered at computing histogram bins')
- boostFromAverage = Param(parent='undefined', name='boostFromAverage', doc='Adjusts initial score to the mean of labels for faster convergence')
- boostingType = Param(parent='undefined', name='boostingType', doc='Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling). ')
- categoricalSlotIndexes = Param(parent='undefined', name='categoricalSlotIndexes', doc='List of categorical column indexes, the slot index in the features column')
- categoricalSlotNames = Param(parent='undefined', name='categoricalSlotNames', doc='List of categorical column slot names, the slot name in the features column')
- chunkSize = Param(parent='undefined', name='chunkSize', doc='Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.')
- defaultListenPort = Param(parent='undefined', name='defaultListenPort', doc='The default listen port on executors, used for testing')
- driverListenPort = Param(parent='undefined', name='driverListenPort', doc='The listen port on a driver. Default value is 0 (random)')
- dropRate = Param(parent='undefined', name='dropRate', doc='Dropout rate: a fraction of previous trees to drop during the dropout')
- earlyStoppingRound = Param(parent='undefined', name='earlyStoppingRound', doc='Early stopping round')
- evalAt = Param(parent='undefined', name='evalAt', doc='NDCG and MAP evaluation positions, separated by comma')
- featureFraction = Param(parent='undefined', name='featureFraction', doc='Feature fraction')
- featuresCol = Param(parent='undefined', name='featuresCol', doc='features column name')
- featuresShapCol = Param(parent='undefined', name='featuresShapCol', doc='Output SHAP vector column name after prediction containing the feature contribution values')
- fobj = Param(parent='undefined', name='fobj', doc='Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).')
- getBinSampleCount()[source]
- Returns
Number of samples considered at computing histogram bins
- Return type
binSampleCount
- getBoostFromAverage()[source]
- Returns
Adjusts initial score to the mean of labels for faster convergence
- Return type
boostFromAverage
- getBoostingType()[source]
- Returns
Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
- Return type
boostingType
- getCategoricalSlotIndexes()[source]
- Returns
List of categorical column indexes, the slot index in the features column
- Return type
categoricalSlotIndexes
- getCategoricalSlotNames()[source]
- Returns
List of categorical column slot names, the slot name in the features column
- Return type
categoricalSlotNames
- getChunkSize()[source]
- Returns
Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
- Return type
chunkSize
- getDefaultListenPort()[source]
- Returns
The default listen port on executors, used for testing
- Return type
defaultListenPort
- getDriverListenPort()[source]
- Returns
The listen port on a driver. Default value is 0 (random)
- Return type
driverListenPort
- getDropRate()[source]
- Returns
Dropout rate: a fraction of previous trees to drop during the dropout
- Return type
dropRate
- getEvalAt()[source]
- Returns
NDCG and MAP evaluation positions, separated by comma
- Return type
evalAt
- getFeaturesShapCol()[source]
- Returns
Output SHAP vector column name after prediction containing the feature contribution values
- Return type
featuresShapCol
- getFobj()[source]
- Returns
Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
- Return type
fobj
- getImprovementTolerance()[source]
- Returns
Tolerance to consider improvement in metric
- Return type
improvementTolerance
- getInitScoreCol()[source]
- Returns
The name of the initial score column, used for continued training
- Return type
initScoreCol
- getIsProvideTrainingMetric()[source]
- Returns
Whether output metric result over training dataset.
- Return type
isProvideTrainingMetric
- getLeafPredictionCol()[source]
- Returns
Predicted leaf indices’s column name
- Return type
leafPredictionCol
- getMatrixType()[source]
- Returns
Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
- Return type
matrixType
- getMaxBinByFeature()[source]
- Returns
Max number of bins for each feature
- Return type
maxBinByFeature
- getMaxDeltaStep()[source]
- Returns
Used to limit the max output of tree leaves
- Return type
maxDeltaStep
- getMaxDrop()[source]
- Returns
Max number of dropped trees during one boosting iteration
- Return type
maxDrop
- getMetric()[source]
- Returns
Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
- Return type
metric
- getMinDataInLeaf()[source]
- Returns
Minimal number of data in one leaf. Can be used to deal with over-fitting.
- Return type
minDataInLeaf
- getMinSumHessianInLeaf()[source]
- Returns
Minimal sum hessian in one leaf
- Return type
minSumHessianInLeaf
- getNumBatches()[source]
- Returns
If greater than 0, splits data into separate batches during training
- Return type
numBatches
- getNumIterations()[source]
- Returns
Number of iterations, LightGBM constructs num_class * num_iterations trees
- Return type
numIterations
- getNumTasks()[source]
- Returns
Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
- Return type
numTasks
- getNumThreads()[source]
- Returns
Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.
- Return type
numThreads
- getObjective()[source]
- Returns
The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
- Return type
objective
- getParallelism()[source]
- Returns
Tree learner parallelism, can be set to data_parallel or voting_parallel
- Return type
parallelism
- getRepartitionByGroupingColumn()[source]
- Returns
Repartition training data according to grouping column, on by default.
- Return type
repartitionByGroupingColumn
- getSkipDrop()[source]
- Returns
Probability of skipping the dropout procedure during a boosting iteration
- Return type
skipDrop
- getTopK()[source]
- Returns
The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
- Return type
topK
- getUniformDrop()[source]
- Returns
Set this to true to use uniform drop in dart mode
- Return type
uniformDrop
- getUseBarrierExecutionMode()[source]
- Returns
Barrier execution mode which uses a barrier stage, off by default.
- Return type
useBarrierExecutionMode
- getUseSingleDatasetMode()[source]
- Returns
Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.
- Return type
useSingleDatasetMode
- getValidationIndicatorCol()[source]
- Returns
Indicates whether the row is for training or validation
- Return type
validationIndicatorCol
- getVerbosity()[source]
- Returns
Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
- Return type
verbosity
- getXgboostDartMode()[source]
- Returns
Set this to true to use xgboost dart mode
- Return type
xgboostDartMode
- groupCol = Param(parent='undefined', name='groupCol', doc='The name of the group column')
- improvementTolerance = Param(parent='undefined', name='improvementTolerance', doc='Tolerance to consider improvement in metric')
- initScoreCol = Param(parent='undefined', name='initScoreCol', doc='The name of the initial score column, used for continued training')
- isProvideTrainingMetric = Param(parent='undefined', name='isProvideTrainingMetric', doc='Whether output metric result over training dataset.')
- labelCol = Param(parent='undefined', name='labelCol', doc='label column name')
- labelGain = Param(parent='undefined', name='labelGain', doc='graded relevance for each label in NDCG')
- lambdaL1 = Param(parent='undefined', name='lambdaL1', doc='L1 regularization')
- lambdaL2 = Param(parent='undefined', name='lambdaL2', doc='L2 regularization')
- leafPredictionCol = Param(parent='undefined', name='leafPredictionCol', doc="Predicted leaf indices's column name")
- learningRate = Param(parent='undefined', name='learningRate', doc='Learning rate or shrinkage rate')
- matrixType = Param(parent='undefined', name='matrixType', doc='Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.')
- maxBin = Param(parent='undefined', name='maxBin', doc='Max bin')
- maxBinByFeature = Param(parent='undefined', name='maxBinByFeature', doc='Max number of bins for each feature')
- maxDeltaStep = Param(parent='undefined', name='maxDeltaStep', doc='Used to limit the max output of tree leaves')
- maxDepth = Param(parent='undefined', name='maxDepth', doc='Max depth')
- maxDrop = Param(parent='undefined', name='maxDrop', doc='Max number of dropped trees during one boosting iteration')
- maxPosition = Param(parent='undefined', name='maxPosition', doc='optimized NDCG at this position')
- metric = Param(parent='undefined', name='metric', doc='Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv. ')
- minDataInLeaf = Param(parent='undefined', name='minDataInLeaf', doc='Minimal number of data in one leaf. Can be used to deal with over-fitting.')
- minGainToSplit = Param(parent='undefined', name='minGainToSplit', doc='The minimal gain to perform split')
- minSumHessianInLeaf = Param(parent='undefined', name='minSumHessianInLeaf', doc='Minimal sum hessian in one leaf')
- modelString = Param(parent='undefined', name='modelString', doc='LightGBM model to retrain')
- negBaggingFraction = Param(parent='undefined', name='negBaggingFraction', doc='Negative Bagging fraction')
- numBatches = Param(parent='undefined', name='numBatches', doc='If greater than 0, splits data into separate batches during training')
- numIterations = Param(parent='undefined', name='numIterations', doc='Number of iterations, LightGBM constructs num_class * num_iterations trees')
- numLeaves = Param(parent='undefined', name='numLeaves', doc='Number of leaves')
- numTasks = Param(parent='undefined', name='numTasks', doc='Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.')
- numThreads = Param(parent='undefined', name='numThreads', doc='Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.')
- objective = Param(parent='undefined', name='objective', doc='The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova. ')
- parallelism = Param(parent='undefined', name='parallelism', doc='Tree learner parallelism, can be set to data_parallel or voting_parallel')
- posBaggingFraction = Param(parent='undefined', name='posBaggingFraction', doc='Positive Bagging fraction')
- predictionCol = Param(parent='undefined', name='predictionCol', doc='prediction column name')
- repartitionByGroupingColumn = Param(parent='undefined', name='repartitionByGroupingColumn', doc='Repartition training data according to grouping column, on by default.')
- setBinSampleCount(value)[source]
- Parameters
binSampleCount – Number of samples considered at computing histogram bins
- setBoostFromAverage(value)[source]
- Parameters
boostFromAverage – Adjusts initial score to the mean of labels for faster convergence
- setBoostingType(value)[source]
- Parameters
boostingType – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
- setCategoricalSlotIndexes(value)[source]
- Parameters
categoricalSlotIndexes – List of categorical column indexes, the slot index in the features column
- setCategoricalSlotNames(value)[source]
- Parameters
categoricalSlotNames – List of categorical column slot names, the slot name in the features column
- setChunkSize(value)[source]
- Parameters
chunkSize – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
- setDefaultListenPort(value)[source]
- Parameters
defaultListenPort – The default listen port on executors, used for testing
- setDriverListenPort(value)[source]
- Parameters
driverListenPort – The listen port on a driver. Default value is 0 (random)
- setDropRate(value)[source]
- Parameters
dropRate – Dropout rate: a fraction of previous trees to drop during the dropout
- setFeaturesShapCol(value)[source]
- Parameters
featuresShapCol – Output SHAP vector column name after prediction containing the feature contribution values
- setFobj(value)[source]
- Parameters
fobj – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
- setImprovementTolerance(value)[source]
- Parameters
improvementTolerance – Tolerance to consider improvement in metric
- setInitScoreCol(value)[source]
- Parameters
initScoreCol – The name of the initial score column, used for continued training
- setIsProvideTrainingMetric(value)[source]
- Parameters
isProvideTrainingMetric – Whether output metric result over training dataset.
- setLeafPredictionCol(value)[source]
- Parameters
leafPredictionCol – Predicted leaf indices’s column name
- setMatrixType(value)[source]
- Parameters
matrixType – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
- setMaxDeltaStep(value)[source]
- Parameters
maxDeltaStep – Used to limit the max output of tree leaves
- setMaxDrop(value)[source]
- Parameters
maxDrop – Max number of dropped trees during one boosting iteration
- setMetric(value)[source]
- Parameters
metric – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
- setMinDataInLeaf(value)[source]
- Parameters
minDataInLeaf – Minimal number of data in one leaf. Can be used to deal with over-fitting.
- setMinSumHessianInLeaf(value)[source]
- Parameters
minSumHessianInLeaf – Minimal sum hessian in one leaf
- setNumBatches(value)[source]
- Parameters
numBatches – If greater than 0, splits data into separate batches during training
- setNumIterations(value)[source]
- Parameters
numIterations – Number of iterations, LightGBM constructs num_class * num_iterations trees
- setNumTasks(value)[source]
- Parameters
numTasks – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
- setNumThreads(value)[source]
- Parameters
numThreads – Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.
- setObjective(value)[source]
- Parameters
objective – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
- setParallelism(value)[source]
- Parameters
parallelism – Tree learner parallelism, can be set to data_parallel or voting_parallel
- setParams(baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', categoricalSlotIndexes=[], categoricalSlotNames=[], chunkSize=10000, defaultListenPort=12400, driverListenPort=0, dropRate=0.1, earlyStoppingRound=0, evalAt=[1, 2, 3, 4, 5], featureFraction=1.0, featuresCol='features', featuresShapCol='', fobj=None, groupCol=None, improvementTolerance=0.0, initScoreCol=None, isProvideTrainingMetric=False, labelCol='label', labelGain=[], lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxPosition=20, metric='', minDataInLeaf=20, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='lambdarank', parallelism='data_parallel', posBaggingFraction=1.0, predictionCol='prediction', repartitionByGroupingColumn=True, skipDrop=0.5, slotNames=[], timeout=1200.0, topK=20, uniformDrop=False, useBarrierExecutionMode=False, useSingleDatasetMode=False, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xgboostDartMode=False)[source]
Set the (keyword only) parameters
- setRepartitionByGroupingColumn(value)[source]
- Parameters
repartitionByGroupingColumn – Repartition training data according to grouping column, on by default.
- setSkipDrop(value)[source]
- Parameters
skipDrop – Probability of skipping the dropout procedure during a boosting iteration
- setTopK(value)[source]
- Parameters
topK – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
- setUniformDrop(value)[source]
- Parameters
uniformDrop – Set this to true to use uniform drop in dart mode
- setUseBarrierExecutionMode(value)[source]
- Parameters
useBarrierExecutionMode – Barrier execution mode which uses a barrier stage, off by default.
- setUseSingleDatasetMode(value)[source]
- Parameters
useSingleDatasetMode – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.
- setValidationIndicatorCol(value)[source]
- Parameters
validationIndicatorCol – Indicates whether the row is for training or validation
- setVerbosity(value)[source]
- Parameters
verbosity – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
- setXgboostDartMode(value)[source]
- Parameters
xgboostDartMode – Set this to true to use xgboost dart mode
- skipDrop = Param(parent='undefined', name='skipDrop', doc='Probability of skipping the dropout procedure during a boosting iteration')
- slotNames = Param(parent='undefined', name='slotNames', doc='List of slot names in the features column')
- timeout = Param(parent='undefined', name='timeout', doc='Timeout in seconds')
- topK = Param(parent='undefined', name='topK', doc='The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0')
- uniformDrop = Param(parent='undefined', name='uniformDrop', doc='Set this to true to use uniform drop in dart mode')
- useBarrierExecutionMode = Param(parent='undefined', name='useBarrierExecutionMode', doc='Barrier execution mode which uses a barrier stage, off by default.')
- useSingleDatasetMode = Param(parent='undefined', name='useSingleDatasetMode', doc='Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.')
- validationIndicatorCol = Param(parent='undefined', name='validationIndicatorCol', doc='Indicates whether the row is for training or validation')
- verbosity = Param(parent='undefined', name='verbosity', doc='Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug')
- weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')
- xgboostDartMode = Param(parent='undefined', name='xgboostDartMode', doc='Set this to true to use xgboost dart mode')
synapse.ml.lightgbm.LightGBMRankerModel module
- class synapse.ml.lightgbm.LightGBMRankerModel.LightGBMRankerModel(java_obj=None, featuresCol='features', featuresShapCol='', labelCol='label', leafPredictionCol='', lightGBMBooster=None, numIterations=- 1, predictionCol='prediction', startIteration=0)[source]
Bases:
synapse.ml.lightgbm.mixin.LightGBMModelMixin
,synapse.ml.lightgbm._LightGBMRankerModel._LightGBMRankerModel
synapse.ml.lightgbm.LightGBMRegressionModel module
- class synapse.ml.lightgbm.LightGBMRegressionModel.LightGBMRegressionModel(java_obj=None, featuresCol='features', featuresShapCol='', labelCol='label', leafPredictionCol='', lightGBMBooster=None, numIterations=- 1, predictionCol='prediction', startIteration=0)[source]
Bases:
synapse.ml.lightgbm.mixin.LightGBMModelMixin
,synapse.ml.lightgbm._LightGBMRegressionModel._LightGBMRegressionModel
synapse.ml.lightgbm.LightGBMRegressor module
- class synapse.ml.lightgbm.LightGBMRegressor.LightGBMRegressor(java_obj=None, alpha=0.9, baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', categoricalSlotIndexes=[], categoricalSlotNames=[], chunkSize=10000, defaultListenPort=12400, driverListenPort=0, dropRate=0.1, earlyStoppingRound=0, featureFraction=1.0, featuresCol='features', featuresShapCol='', fobj=None, improvementTolerance=0.0, initScoreCol=None, isProvideTrainingMetric=False, labelCol='label', lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, metric='', minDataInLeaf=20, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='regression', parallelism='data_parallel', posBaggingFraction=1.0, predictionCol='prediction', repartitionByGroupingColumn=True, skipDrop=0.5, slotNames=[], timeout=1200.0, topK=20, tweedieVariancePower=1.5, uniformDrop=False, useBarrierExecutionMode=False, useSingleDatasetMode=False, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xgboostDartMode=False)[source]
Bases:
synapse.ml.core.schema.Utils.ComplexParamsMixin
,pyspark.ml.util.JavaMLReadable
,pyspark.ml.util.JavaMLWritable
,pyspark.ml.wrapper.JavaEstimator
- Parameters
alpha (float) – parameter for Huber loss and Quantile regression
baggingFraction (float) – Bagging fraction
baggingFreq (int) – Bagging frequency
baggingSeed (int) – Bagging seed
binSampleCount (int) – Number of samples considered at computing histogram bins
boostFromAverage (bool) – Adjusts initial score to the mean of labels for faster convergence
boostingType (object) – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
categoricalSlotIndexes (list) – List of categorical column indexes, the slot index in the features column
categoricalSlotNames (list) – List of categorical column slot names, the slot name in the features column
chunkSize (int) – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
defaultListenPort (int) – The default listen port on executors, used for testing
driverListenPort (int) – The listen port on a driver. Default value is 0 (random)
dropRate (float) – Dropout rate: a fraction of previous trees to drop during the dropout
earlyStoppingRound (int) – Early stopping round
featureFraction (float) – Feature fraction
featuresCol (object) – features column name
featuresShapCol (object) – Output SHAP vector column name after prediction containing the feature contribution values
fobj (object) – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
improvementTolerance (float) – Tolerance to consider improvement in metric
initScoreCol (object) – The name of the initial score column, used for continued training
isProvideTrainingMetric (bool) – Whether output metric result over training dataset.
labelCol (object) – label column name
lambdaL1 (float) – L1 regularization
lambdaL2 (float) – L2 regularization
leafPredictionCol (object) – Predicted leaf indices’s column name
learningRate (float) – Learning rate or shrinkage rate
matrixType (object) – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
maxBin (int) – Max bin
maxBinByFeature (list) – Max number of bins for each feature
maxDeltaStep (float) – Used to limit the max output of tree leaves
maxDepth (int) – Max depth
maxDrop (int) – Max number of dropped trees during one boosting iteration
metric (object) – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
minDataInLeaf (int) – Minimal number of data in one leaf. Can be used to deal with over-fitting.
minGainToSplit (float) – The minimal gain to perform split
minSumHessianInLeaf (float) – Minimal sum hessian in one leaf
modelString (object) – LightGBM model to retrain
negBaggingFraction (float) – Negative Bagging fraction
numBatches (int) – If greater than 0, splits data into separate batches during training
numIterations (int) – Number of iterations, LightGBM constructs num_class * num_iterations trees
numLeaves (int) – Number of leaves
numTasks (int) – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
numThreads (int) – Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.
objective (object) – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
parallelism (object) – Tree learner parallelism, can be set to data_parallel or voting_parallel
posBaggingFraction (float) – Positive Bagging fraction
predictionCol (object) – prediction column name
repartitionByGroupingColumn (bool) – Repartition training data according to grouping column, on by default.
skipDrop (float) – Probability of skipping the dropout procedure during a boosting iteration
slotNames (list) – List of slot names in the features column
timeout (float) – Timeout in seconds
topK (int) – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
tweedieVariancePower (float) – control the variance of tweedie distribution, must be between 1 and 2
uniformDrop (bool) – Set this to true to use uniform drop in dart mode
useBarrierExecutionMode (bool) – Barrier execution mode which uses a barrier stage, off by default.
useSingleDatasetMode (bool) – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.
validationIndicatorCol (object) – Indicates whether the row is for training or validation
verbosity (int) – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
weightCol (object) – The name of the weight column
xgboostDartMode (bool) – Set this to true to use xgboost dart mode
- alpha = Param(parent='undefined', name='alpha', doc='parameter for Huber loss and Quantile regression')
- baggingFraction = Param(parent='undefined', name='baggingFraction', doc='Bagging fraction')
- baggingFreq = Param(parent='undefined', name='baggingFreq', doc='Bagging frequency')
- baggingSeed = Param(parent='undefined', name='baggingSeed', doc='Bagging seed')
- binSampleCount = Param(parent='undefined', name='binSampleCount', doc='Number of samples considered at computing histogram bins')
- boostFromAverage = Param(parent='undefined', name='boostFromAverage', doc='Adjusts initial score to the mean of labels for faster convergence')
- boostingType = Param(parent='undefined', name='boostingType', doc='Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling). ')
- categoricalSlotIndexes = Param(parent='undefined', name='categoricalSlotIndexes', doc='List of categorical column indexes, the slot index in the features column')
- categoricalSlotNames = Param(parent='undefined', name='categoricalSlotNames', doc='List of categorical column slot names, the slot name in the features column')
- chunkSize = Param(parent='undefined', name='chunkSize', doc='Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.')
- defaultListenPort = Param(parent='undefined', name='defaultListenPort', doc='The default listen port on executors, used for testing')
- driverListenPort = Param(parent='undefined', name='driverListenPort', doc='The listen port on a driver. Default value is 0 (random)')
- dropRate = Param(parent='undefined', name='dropRate', doc='Dropout rate: a fraction of previous trees to drop during the dropout')
- earlyStoppingRound = Param(parent='undefined', name='earlyStoppingRound', doc='Early stopping round')
- featureFraction = Param(parent='undefined', name='featureFraction', doc='Feature fraction')
- featuresCol = Param(parent='undefined', name='featuresCol', doc='features column name')
- featuresShapCol = Param(parent='undefined', name='featuresShapCol', doc='Output SHAP vector column name after prediction containing the feature contribution values')
- fobj = Param(parent='undefined', name='fobj', doc='Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).')
- getBinSampleCount()[source]
- Returns
Number of samples considered at computing histogram bins
- Return type
binSampleCount
- getBoostFromAverage()[source]
- Returns
Adjusts initial score to the mean of labels for faster convergence
- Return type
boostFromAverage
- getBoostingType()[source]
- Returns
Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
- Return type
boostingType
- getCategoricalSlotIndexes()[source]
- Returns
List of categorical column indexes, the slot index in the features column
- Return type
categoricalSlotIndexes
- getCategoricalSlotNames()[source]
- Returns
List of categorical column slot names, the slot name in the features column
- Return type
categoricalSlotNames
- getChunkSize()[source]
- Returns
Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
- Return type
chunkSize
- getDefaultListenPort()[source]
- Returns
The default listen port on executors, used for testing
- Return type
defaultListenPort
- getDriverListenPort()[source]
- Returns
The listen port on a driver. Default value is 0 (random)
- Return type
driverListenPort
- getDropRate()[source]
- Returns
Dropout rate: a fraction of previous trees to drop during the dropout
- Return type
dropRate
- getFeaturesShapCol()[source]
- Returns
Output SHAP vector column name after prediction containing the feature contribution values
- Return type
featuresShapCol
- getFobj()[source]
- Returns
Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
- Return type
fobj
- getImprovementTolerance()[source]
- Returns
Tolerance to consider improvement in metric
- Return type
improvementTolerance
- getInitScoreCol()[source]
- Returns
The name of the initial score column, used for continued training
- Return type
initScoreCol
- getIsProvideTrainingMetric()[source]
- Returns
Whether output metric result over training dataset.
- Return type
isProvideTrainingMetric
- getLeafPredictionCol()[source]
- Returns
Predicted leaf indices’s column name
- Return type
leafPredictionCol
- getMatrixType()[source]
- Returns
Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
- Return type
matrixType
- getMaxBinByFeature()[source]
- Returns
Max number of bins for each feature
- Return type
maxBinByFeature
- getMaxDeltaStep()[source]
- Returns
Used to limit the max output of tree leaves
- Return type
maxDeltaStep
- getMaxDrop()[source]
- Returns
Max number of dropped trees during one boosting iteration
- Return type
maxDrop
- getMetric()[source]
- Returns
Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
- Return type
metric
- getMinDataInLeaf()[source]
- Returns
Minimal number of data in one leaf. Can be used to deal with over-fitting.
- Return type
minDataInLeaf
- getMinSumHessianInLeaf()[source]
- Returns
Minimal sum hessian in one leaf
- Return type
minSumHessianInLeaf
- getNumBatches()[source]
- Returns
If greater than 0, splits data into separate batches during training
- Return type
numBatches
- getNumIterations()[source]
- Returns
Number of iterations, LightGBM constructs num_class * num_iterations trees
- Return type
numIterations
- getNumTasks()[source]
- Returns
Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
- Return type
numTasks
- getNumThreads()[source]
- Returns
Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.
- Return type
numThreads
- getObjective()[source]
- Returns
The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
- Return type
objective
- getParallelism()[source]
- Returns
Tree learner parallelism, can be set to data_parallel or voting_parallel
- Return type
parallelism
- getRepartitionByGroupingColumn()[source]
- Returns
Repartition training data according to grouping column, on by default.
- Return type
repartitionByGroupingColumn
- getSkipDrop()[source]
- Returns
Probability of skipping the dropout procedure during a boosting iteration
- Return type
skipDrop
- getTopK()[source]
- Returns
The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
- Return type
topK
- getTweedieVariancePower()[source]
- Returns
control the variance of tweedie distribution, must be between 1 and 2
- Return type
tweedieVariancePower
- getUniformDrop()[source]
- Returns
Set this to true to use uniform drop in dart mode
- Return type
uniformDrop
- getUseBarrierExecutionMode()[source]
- Returns
Barrier execution mode which uses a barrier stage, off by default.
- Return type
useBarrierExecutionMode
- getUseSingleDatasetMode()[source]
- Returns
Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.
- Return type
useSingleDatasetMode
- getValidationIndicatorCol()[source]
- Returns
Indicates whether the row is for training or validation
- Return type
validationIndicatorCol
- getVerbosity()[source]
- Returns
Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
- Return type
verbosity
- getXgboostDartMode()[source]
- Returns
Set this to true to use xgboost dart mode
- Return type
xgboostDartMode
- improvementTolerance = Param(parent='undefined', name='improvementTolerance', doc='Tolerance to consider improvement in metric')
- initScoreCol = Param(parent='undefined', name='initScoreCol', doc='The name of the initial score column, used for continued training')
- isProvideTrainingMetric = Param(parent='undefined', name='isProvideTrainingMetric', doc='Whether output metric result over training dataset.')
- labelCol = Param(parent='undefined', name='labelCol', doc='label column name')
- lambdaL1 = Param(parent='undefined', name='lambdaL1', doc='L1 regularization')
- lambdaL2 = Param(parent='undefined', name='lambdaL2', doc='L2 regularization')
- leafPredictionCol = Param(parent='undefined', name='leafPredictionCol', doc="Predicted leaf indices's column name")
- learningRate = Param(parent='undefined', name='learningRate', doc='Learning rate or shrinkage rate')
- matrixType = Param(parent='undefined', name='matrixType', doc='Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.')
- maxBin = Param(parent='undefined', name='maxBin', doc='Max bin')
- maxBinByFeature = Param(parent='undefined', name='maxBinByFeature', doc='Max number of bins for each feature')
- maxDeltaStep = Param(parent='undefined', name='maxDeltaStep', doc='Used to limit the max output of tree leaves')
- maxDepth = Param(parent='undefined', name='maxDepth', doc='Max depth')
- maxDrop = Param(parent='undefined', name='maxDrop', doc='Max number of dropped trees during one boosting iteration')
- metric = Param(parent='undefined', name='metric', doc='Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv. ')
- minDataInLeaf = Param(parent='undefined', name='minDataInLeaf', doc='Minimal number of data in one leaf. Can be used to deal with over-fitting.')
- minGainToSplit = Param(parent='undefined', name='minGainToSplit', doc='The minimal gain to perform split')
- minSumHessianInLeaf = Param(parent='undefined', name='minSumHessianInLeaf', doc='Minimal sum hessian in one leaf')
- modelString = Param(parent='undefined', name='modelString', doc='LightGBM model to retrain')
- negBaggingFraction = Param(parent='undefined', name='negBaggingFraction', doc='Negative Bagging fraction')
- numBatches = Param(parent='undefined', name='numBatches', doc='If greater than 0, splits data into separate batches during training')
- numIterations = Param(parent='undefined', name='numIterations', doc='Number of iterations, LightGBM constructs num_class * num_iterations trees')
- numLeaves = Param(parent='undefined', name='numLeaves', doc='Number of leaves')
- numTasks = Param(parent='undefined', name='numTasks', doc='Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.')
- numThreads = Param(parent='undefined', name='numThreads', doc='Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.')
- objective = Param(parent='undefined', name='objective', doc='The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova. ')
- parallelism = Param(parent='undefined', name='parallelism', doc='Tree learner parallelism, can be set to data_parallel or voting_parallel')
- posBaggingFraction = Param(parent='undefined', name='posBaggingFraction', doc='Positive Bagging fraction')
- predictionCol = Param(parent='undefined', name='predictionCol', doc='prediction column name')
- repartitionByGroupingColumn = Param(parent='undefined', name='repartitionByGroupingColumn', doc='Repartition training data according to grouping column, on by default.')
- setBinSampleCount(value)[source]
- Parameters
binSampleCount – Number of samples considered at computing histogram bins
- setBoostFromAverage(value)[source]
- Parameters
boostFromAverage – Adjusts initial score to the mean of labels for faster convergence
- setBoostingType(value)[source]
- Parameters
boostingType – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
- setCategoricalSlotIndexes(value)[source]
- Parameters
categoricalSlotIndexes – List of categorical column indexes, the slot index in the features column
- setCategoricalSlotNames(value)[source]
- Parameters
categoricalSlotNames – List of categorical column slot names, the slot name in the features column
- setChunkSize(value)[source]
- Parameters
chunkSize – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
- setDefaultListenPort(value)[source]
- Parameters
defaultListenPort – The default listen port on executors, used for testing
- setDriverListenPort(value)[source]
- Parameters
driverListenPort – The listen port on a driver. Default value is 0 (random)
- setDropRate(value)[source]
- Parameters
dropRate – Dropout rate: a fraction of previous trees to drop during the dropout
- setFeaturesShapCol(value)[source]
- Parameters
featuresShapCol – Output SHAP vector column name after prediction containing the feature contribution values
- setFobj(value)[source]
- Parameters
fobj – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
- setImprovementTolerance(value)[source]
- Parameters
improvementTolerance – Tolerance to consider improvement in metric
- setInitScoreCol(value)[source]
- Parameters
initScoreCol – The name of the initial score column, used for continued training
- setIsProvideTrainingMetric(value)[source]
- Parameters
isProvideTrainingMetric – Whether output metric result over training dataset.
- setLeafPredictionCol(value)[source]
- Parameters
leafPredictionCol – Predicted leaf indices’s column name
- setMatrixType(value)[source]
- Parameters
matrixType – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
- setMaxDeltaStep(value)[source]
- Parameters
maxDeltaStep – Used to limit the max output of tree leaves
- setMaxDrop(value)[source]
- Parameters
maxDrop – Max number of dropped trees during one boosting iteration
- setMetric(value)[source]
- Parameters
metric – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
- setMinDataInLeaf(value)[source]
- Parameters
minDataInLeaf – Minimal number of data in one leaf. Can be used to deal with over-fitting.
- setMinSumHessianInLeaf(value)[source]
- Parameters
minSumHessianInLeaf – Minimal sum hessian in one leaf
- setNumBatches(value)[source]
- Parameters
numBatches – If greater than 0, splits data into separate batches during training
- setNumIterations(value)[source]
- Parameters
numIterations – Number of iterations, LightGBM constructs num_class * num_iterations trees
- setNumTasks(value)[source]
- Parameters
numTasks – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
- setNumThreads(value)[source]
- Parameters
numThreads – Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores.
- setObjective(value)[source]
- Parameters
objective – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
- setParallelism(value)[source]
- Parameters
parallelism – Tree learner parallelism, can be set to data_parallel or voting_parallel
- setParams(alpha=0.9, baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', categoricalSlotIndexes=[], categoricalSlotNames=[], chunkSize=10000, defaultListenPort=12400, driverListenPort=0, dropRate=0.1, earlyStoppingRound=0, featureFraction=1.0, featuresCol='features', featuresShapCol='', fobj=None, improvementTolerance=0.0, initScoreCol=None, isProvideTrainingMetric=False, labelCol='label', lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, metric='', minDataInLeaf=20, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='regression', parallelism='data_parallel', posBaggingFraction=1.0, predictionCol='prediction', repartitionByGroupingColumn=True, skipDrop=0.5, slotNames=[], timeout=1200.0, topK=20, tweedieVariancePower=1.5, uniformDrop=False, useBarrierExecutionMode=False, useSingleDatasetMode=False, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xgboostDartMode=False)[source]
Set the (keyword only) parameters
- setRepartitionByGroupingColumn(value)[source]
- Parameters
repartitionByGroupingColumn – Repartition training data according to grouping column, on by default.
- setSkipDrop(value)[source]
- Parameters
skipDrop – Probability of skipping the dropout procedure during a boosting iteration
- setTopK(value)[source]
- Parameters
topK – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
- setTweedieVariancePower(value)[source]
- Parameters
tweedieVariancePower – control the variance of tweedie distribution, must be between 1 and 2
- setUniformDrop(value)[source]
- Parameters
uniformDrop – Set this to true to use uniform drop in dart mode
- setUseBarrierExecutionMode(value)[source]
- Parameters
useBarrierExecutionMode – Barrier execution mode which uses a barrier stage, off by default.
- setUseSingleDatasetMode(value)[source]
- Parameters
useSingleDatasetMode – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.
- setValidationIndicatorCol(value)[source]
- Parameters
validationIndicatorCol – Indicates whether the row is for training or validation
- setVerbosity(value)[source]
- Parameters
verbosity – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
- setXgboostDartMode(value)[source]
- Parameters
xgboostDartMode – Set this to true to use xgboost dart mode
- skipDrop = Param(parent='undefined', name='skipDrop', doc='Probability of skipping the dropout procedure during a boosting iteration')
- slotNames = Param(parent='undefined', name='slotNames', doc='List of slot names in the features column')
- timeout = Param(parent='undefined', name='timeout', doc='Timeout in seconds')
- topK = Param(parent='undefined', name='topK', doc='The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0')
- tweedieVariancePower = Param(parent='undefined', name='tweedieVariancePower', doc='control the variance of tweedie distribution, must be between 1 and 2')
- uniformDrop = Param(parent='undefined', name='uniformDrop', doc='Set this to true to use uniform drop in dart mode')
- useBarrierExecutionMode = Param(parent='undefined', name='useBarrierExecutionMode', doc='Barrier execution mode which uses a barrier stage, off by default.')
- useSingleDatasetMode = Param(parent='undefined', name='useSingleDatasetMode', doc='Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead. Note this is disabled when running spark in local mode.')
- validationIndicatorCol = Param(parent='undefined', name='validationIndicatorCol', doc='Indicates whether the row is for training or validation')
- verbosity = Param(parent='undefined', name='verbosity', doc='Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug')
- weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')
- xgboostDartMode = Param(parent='undefined', name='xgboostDartMode', doc='Set this to true to use xgboost dart mode')
synapse.ml.lightgbm.mixin module
- class synapse.ml.lightgbm.mixin.LightGBMModelMixin[source]
Bases:
object
- getBoosterBestIteration()[source]
Get the best iteration from the booster.
- Returns
The best iteration, if early stopping was triggered.
- getBoosterNumFeatures()[source]
Get the number of features from the booster.
- Returns
The number of features.
- getBoosterNumTotalIterations()[source]
Get the total number of iterations trained.
- Returns
The total number of iterations trained.
- getBoosterNumTotalModel()[source]
Get the total number of models trained.
- Returns
The total number of models.
Module contents
SynapseML is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark in several new directions. SynapseML adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV. These tools enable powerful and highly-scalable predictive and analytical models for a variety of datasources.
SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. In this vein, SynapseML provides easy to use SparkML transformers for a wide variety of Microsoft Cognitive Services. For production grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.
SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+.