synapse.ml.lightgbm package
Submodules
synapse.ml.lightgbm.LightGBMClassificationModel module
- class synapse.ml.lightgbm.LightGBMClassificationModel.LightGBMClassificationModel(java_obj=None, actualNumClasses=None, featuresCol='features', featuresShapCol='', labelCol='label', leafPredictionCol='', lightGBMBooster=None, numIterations=- 1, predictDisableShapeCheck=False, predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', startIteration=0, thresholds=None)[source]
Bases:
LightGBMModelMixin
,_LightGBMClassificationModel
synapse.ml.lightgbm.LightGBMClassifier module
- class synapse.ml.lightgbm.LightGBMClassifier.LightGBMClassifier(java_obj=None, baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', catSmooth=10.0, categoricalSlotIndexes=[], categoricalSlotNames=[], catl2=10.0, chunkSize=10000, dataRandomSeed=1, dataTransferMode='streaming', defaultListenPort=12400, deterministic=False, driverListenPort=0, dropRate=0.1, dropSeed=4, earlyStoppingRound=0, executionMode=None, extraSeed=6, featureFraction=1.0, featureFractionByNode=None, featureFractionSeed=2, featuresCol='features', featuresShapCol='', fobj=None, improvementTolerance=0.0, initScoreCol=None, isEnableSparse=True, isProvideTrainingMetric=False, isUnbalance=False, labelCol='label', lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxCatThreshold=32, maxCatToOnehot=4, maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxNumClasses=100, maxStreamingOMPThreads=16, metric='', microBatchSize=100, minDataInLeaf=20, minDataPerBin=3, minDataPerGroup=100, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', monotoneConstraints=[], monotoneConstraintsMethod='basic', monotonePenalty=0.0, negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='binary', objectiveSeed=5, otherRate=0.1, parallelism='data_parallel', passThroughArgs='', posBaggingFraction=1.0, predictDisableShapeCheck=False, predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', referenceDataset=None, repartitionByGroupingColumn=True, samplingMode='subset', samplingSubsetSize=1000000, seed=None, skipDrop=0.5, slotNames=[], thresholds=None, timeout=1200.0, topK=20, topRate=0.2, uniformDrop=False, useBarrierExecutionMode=False, useMissing=True, useSingleDatasetMode=True, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xGBoostDartMode=False, zeroAsMissing=False)[source]
Bases:
ComplexParamsMixin
,JavaMLReadable
,JavaMLWritable
,JavaEstimator
- Parameters:
binSampleCount¶ (int) – Number of samples considered at computing histogram bins
boostFromAverage¶ (bool) – Adjusts initial score to the mean of labels for faster convergence
boostingType¶ (str) – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
catSmooth¶ (float) – this can reduce the effect of noises in categorical features, especially for categories with few data
categoricalSlotIndexes¶ (list) – List of categorical column indexes, the slot index in the features column
categoricalSlotNames¶ (list) – List of categorical column slot names, the slot name in the features column
chunkSize¶ (int) – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
dataRandomSeed¶ (int) – Random seed for sampling data to construct histogram bins.
dataTransferMode¶ (str) – Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.
defaultListenPort¶ (int) – The default listen port on executors, used for testing
deterministic¶ (bool) – Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true
driverListenPort¶ (int) – The listen port on a driver. Default value is 0 (random)
dropRate¶ (float) – Dropout rate: a fraction of previous trees to drop during the dropout
dropSeed¶ (int) – Random seed to choose dropping models. Only used in dart.
executionMode¶ (str) – Deprecated. Please use dataTransferMode.
extraSeed¶ (int) – Random seed for selecting threshold when extra_trees is true
featuresShapCol¶ (str) – Output SHAP vector column name after prediction containing the feature contribution values
fobj¶ (object) – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
improvementTolerance¶ (float) – Tolerance to consider improvement in metric
initScoreCol¶ (str) – The name of the initial score column, used for continued training
isEnableSparse¶ (bool) – Used to enable/disable sparse optimization
isProvideTrainingMetric¶ (bool) – Whether output metric result over training dataset.
isUnbalance¶ (bool) – Set to true if training data is unbalanced in binary classification scenario
leafPredictionCol¶ (str) – Predicted leaf indices’s column name
matrixType¶ (str) – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
maxBinByFeature¶ (list) – Max number of bins for each feature
maxCatThreshold¶ (int) – limit number of split points considered for categorical features
maxCatToOnehot¶ (int) – when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used
maxDeltaStep¶ (float) – Used to limit the max output of tree leaves
maxDrop¶ (int) – Max number of dropped trees during one boosting iteration
maxNumClasses¶ (int) – Number of max classes to infer numClass in multi-class classification.
maxStreamingOMPThreads¶ (int) – Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.
metric¶ (str) – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
microBatchSize¶ (int) – Specify how many elements are sent in a streaming micro-batch.
minDataInLeaf¶ (int) – Minimal number of data in one leaf. Can be used to deal with over-fitting.
minDataPerBin¶ (int) – Minimal number of data inside one bin
minDataPerGroup¶ (int) – minimal number of data per categorical group
minSumHessianInLeaf¶ (float) – Minimal sum hessian in one leaf
monotoneConstraints¶ (list) – used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.
monotoneConstraintsMethod¶ (str) – Monotone constraints method. basic, intermediate, or advanced.
monotonePenalty¶ (float) – A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.
numBatches¶ (int) – If greater than 0, splits data into separate batches during training
numIterations¶ (int) – Number of iterations, LightGBM constructs num_class * num_iterations trees
numTasks¶ (int) – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
numThreads¶ (int) – Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.
objective¶ (str) – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
objectiveSeed¶ (int) – Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.
otherRate¶ (float) – The retain ratio of small gradient data. Only used in goss.
parallelism¶ (str) – Tree learner parallelism, can be set to data_parallel or voting_parallel
passThroughArgs¶ (str) – Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true
predictDisableShapeCheck¶ (bool) – control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data
probabilityCol¶ (str) – Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities
rawPredictionCol¶ (str) – raw prediction (a.k.a. confidence) column name
referenceDataset¶ (list) – The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().
repartitionByGroupingColumn¶ (bool) – Repartition training data according to grouping column, on by default.
samplingMode¶ (str) – Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.
samplingSubsetSize¶ (int) – Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.
skipDrop¶ (float) – Probability of skipping the dropout procedure during a boosting iteration
slotNames¶ (list) – List of slot names in the features column
thresholds¶ (list) – Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold
topK¶ (int) – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
topRate¶ (float) – The retain ratio of large gradient data. Only used in goss.
uniformDrop¶ (bool) – Set this to true to use uniform drop in dart mode
useBarrierExecutionMode¶ (bool) – Barrier execution mode which uses a barrier stage, off by default.
useMissing¶ (bool) – Set this to false to disable the special handle of missing value
useSingleDatasetMode¶ (bool) – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.
validationIndicatorCol¶ (str) – Indicates whether the row is for training or validation
verbosity¶ (int) – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
xGBoostDartMode¶ (bool) – Set this to true to use xgboost dart mode
zeroAsMissing¶ (bool) – Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values
- baggingFraction = Param(parent='undefined', name='baggingFraction', doc='Bagging fraction')
- baggingFreq = Param(parent='undefined', name='baggingFreq', doc='Bagging frequency')
- baggingSeed = Param(parent='undefined', name='baggingSeed', doc='Bagging seed')
- binSampleCount = Param(parent='undefined', name='binSampleCount', doc='Number of samples considered at computing histogram bins')
- boostFromAverage = Param(parent='undefined', name='boostFromAverage', doc='Adjusts initial score to the mean of labels for faster convergence')
- boostingType = Param(parent='undefined', name='boostingType', doc='Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling). ')
- catSmooth = Param(parent='undefined', name='catSmooth', doc='this can reduce the effect of noises in categorical features, especially for categories with few data')
- categoricalSlotIndexes = Param(parent='undefined', name='categoricalSlotIndexes', doc='List of categorical column indexes, the slot index in the features column')
- categoricalSlotNames = Param(parent='undefined', name='categoricalSlotNames', doc='List of categorical column slot names, the slot name in the features column')
- catl2 = Param(parent='undefined', name='catl2', doc='L2 regularization in categorical split')
- chunkSize = Param(parent='undefined', name='chunkSize', doc='Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.')
- dataRandomSeed = Param(parent='undefined', name='dataRandomSeed', doc='Random seed for sampling data to construct histogram bins.')
- dataTransferMode = Param(parent='undefined', name='dataTransferMode', doc='Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.')
- defaultListenPort = Param(parent='undefined', name='defaultListenPort', doc='The default listen port on executors, used for testing')
- deterministic = Param(parent='undefined', name='deterministic', doc='Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true')
- driverListenPort = Param(parent='undefined', name='driverListenPort', doc='The listen port on a driver. Default value is 0 (random)')
- dropRate = Param(parent='undefined', name='dropRate', doc='Dropout rate: a fraction of previous trees to drop during the dropout')
- dropSeed = Param(parent='undefined', name='dropSeed', doc='Random seed to choose dropping models. Only used in dart.')
- earlyStoppingRound = Param(parent='undefined', name='earlyStoppingRound', doc='Early stopping round')
- executionMode = Param(parent='undefined', name='executionMode', doc='Deprecated. Please use dataTransferMode.')
- extraSeed = Param(parent='undefined', name='extraSeed', doc='Random seed for selecting threshold when extra_trees is true')
- featureFraction = Param(parent='undefined', name='featureFraction', doc='Feature fraction')
- featureFractionByNode = Param(parent='undefined', name='featureFractionByNode', doc='Feature fraction by node')
- featureFractionSeed = Param(parent='undefined', name='featureFractionSeed', doc='Feature fraction seed')
- featuresCol = Param(parent='undefined', name='featuresCol', doc='features column name')
- featuresShapCol = Param(parent='undefined', name='featuresShapCol', doc='Output SHAP vector column name after prediction containing the feature contribution values')
- fobj = Param(parent='undefined', name='fobj', doc='Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).')
- getBinSampleCount()[source]
- Returns:
Number of samples considered at computing histogram bins
- Return type:
binSampleCount
- getBoostFromAverage()[source]
- Returns:
Adjusts initial score to the mean of labels for faster convergence
- Return type:
boostFromAverage
- getBoostingType()[source]
- Returns:
Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
- Return type:
boostingType
- getCatSmooth()[source]
- Returns:
this can reduce the effect of noises in categorical features, especially for categories with few data
- Return type:
catSmooth
- getCategoricalSlotIndexes()[source]
- Returns:
List of categorical column indexes, the slot index in the features column
- Return type:
categoricalSlotIndexes
- getCategoricalSlotNames()[source]
- Returns:
List of categorical column slot names, the slot name in the features column
- Return type:
categoricalSlotNames
- getChunkSize()[source]
- Returns:
Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
- Return type:
chunkSize
- getDataRandomSeed()[source]
- Returns:
Random seed for sampling data to construct histogram bins.
- Return type:
dataRandomSeed
- getDataTransferMode()[source]
- Returns:
Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.
- Return type:
dataTransferMode
- getDefaultListenPort()[source]
- Returns:
The default listen port on executors, used for testing
- Return type:
defaultListenPort
- getDeterministic()[source]
- Returns:
Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true
- Return type:
deterministic
- getDriverListenPort()[source]
- Returns:
The listen port on a driver. Default value is 0 (random)
- Return type:
driverListenPort
- getDropRate()[source]
- Returns:
Dropout rate: a fraction of previous trees to drop during the dropout
- Return type:
dropRate
- getDropSeed()[source]
- Returns:
Random seed to choose dropping models. Only used in dart.
- Return type:
dropSeed
- getExecutionMode()[source]
- Returns:
Deprecated. Please use dataTransferMode.
- Return type:
executionMode
- getExtraSeed()[source]
- Returns:
Random seed for selecting threshold when extra_trees is true
- Return type:
extraSeed
- getFeatureFractionByNode()[source]
- Returns:
Feature fraction by node
- Return type:
featureFractionByNode
- getFeaturesShapCol()[source]
- Returns:
Output SHAP vector column name after prediction containing the feature contribution values
- Return type:
featuresShapCol
- getFobj()[source]
- Returns:
Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
- Return type:
fobj
- getImprovementTolerance()[source]
- Returns:
Tolerance to consider improvement in metric
- Return type:
improvementTolerance
- getInitScoreCol()[source]
- Returns:
The name of the initial score column, used for continued training
- Return type:
initScoreCol
- getIsEnableSparse()[source]
- Returns:
Used to enable/disable sparse optimization
- Return type:
isEnableSparse
- getIsProvideTrainingMetric()[source]
- Returns:
Whether output metric result over training dataset.
- Return type:
isProvideTrainingMetric
- getIsUnbalance()[source]
- Returns:
Set to true if training data is unbalanced in binary classification scenario
- Return type:
isUnbalance
- getLeafPredictionCol()[source]
- Returns:
Predicted leaf indices’s column name
- Return type:
leafPredictionCol
- getMatrixType()[source]
- Returns:
Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
- Return type:
matrixType
- getMaxBinByFeature()[source]
- Returns:
Max number of bins for each feature
- Return type:
maxBinByFeature
- getMaxCatThreshold()[source]
- Returns:
limit number of split points considered for categorical features
- Return type:
maxCatThreshold
- getMaxCatToOnehot()[source]
- Returns:
when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used
- Return type:
maxCatToOnehot
- getMaxDeltaStep()[source]
- Returns:
Used to limit the max output of tree leaves
- Return type:
maxDeltaStep
- getMaxDrop()[source]
- Returns:
Max number of dropped trees during one boosting iteration
- Return type:
maxDrop
- getMaxNumClasses()[source]
- Returns:
Number of max classes to infer numClass in multi-class classification.
- Return type:
maxNumClasses
- getMaxStreamingOMPThreads()[source]
- Returns:
Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.
- Return type:
maxStreamingOMPThreads
- getMetric()[source]
- Returns:
Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
- Return type:
metric
- getMicroBatchSize()[source]
- Returns:
Specify how many elements are sent in a streaming micro-batch.
- Return type:
microBatchSize
- getMinDataInLeaf()[source]
- Returns:
Minimal number of data in one leaf. Can be used to deal with over-fitting.
- Return type:
minDataInLeaf
- getMinDataPerBin()[source]
- Returns:
Minimal number of data inside one bin
- Return type:
minDataPerBin
- getMinDataPerGroup()[source]
- Returns:
minimal number of data per categorical group
- Return type:
minDataPerGroup
- getMinSumHessianInLeaf()[source]
- Returns:
Minimal sum hessian in one leaf
- Return type:
minSumHessianInLeaf
- getMonotoneConstraints()[source]
- Returns:
used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.
- Return type:
monotoneConstraints
- getMonotoneConstraintsMethod()[source]
- Returns:
Monotone constraints method. basic, intermediate, or advanced.
- Return type:
monotoneConstraintsMethod
- getMonotonePenalty()[source]
- Returns:
A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.
- Return type:
monotonePenalty
- getNumBatches()[source]
- Returns:
If greater than 0, splits data into separate batches during training
- Return type:
numBatches
- getNumIterations()[source]
- Returns:
Number of iterations, LightGBM constructs num_class * num_iterations trees
- Return type:
numIterations
- getNumTasks()[source]
- Returns:
Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
- Return type:
numTasks
- getNumThreads()[source]
- Returns:
Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.
- Return type:
numThreads
- getObjective()[source]
- Returns:
The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
- Return type:
objective
- getObjectiveSeed()[source]
- Returns:
Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.
- Return type:
objectiveSeed
- getOtherRate()[source]
- Returns:
The retain ratio of small gradient data. Only used in goss.
- Return type:
otherRate
- getParallelism()[source]
- Returns:
Tree learner parallelism, can be set to data_parallel or voting_parallel
- Return type:
parallelism
- getPassThroughArgs()[source]
- Returns:
Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true
- Return type:
passThroughArgs
- getPredictDisableShapeCheck()[source]
- Returns:
control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data
- Return type:
predictDisableShapeCheck
- getProbabilityCol()[source]
- Returns:
Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities
- Return type:
probabilityCol
- getRawPredictionCol()[source]
- Returns:
raw prediction (a.k.a. confidence) column name
- Return type:
rawPredictionCol
- getReferenceDataset()[source]
- Returns:
The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().
- Return type:
referenceDataset
- getRepartitionByGroupingColumn()[source]
- Returns:
Repartition training data according to grouping column, on by default.
- Return type:
repartitionByGroupingColumn
- getSamplingMode()[source]
- Returns:
Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.
- Return type:
samplingMode
- getSamplingSubsetSize()[source]
- Returns:
Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.
- Return type:
samplingSubsetSize
- getSkipDrop()[source]
- Returns:
Probability of skipping the dropout procedure during a boosting iteration
- Return type:
skipDrop
- getThresholds()[source]
- Returns:
Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold
- Return type:
thresholds
- getTopK()[source]
- Returns:
The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
- Return type:
topK
- getTopRate()[source]
- Returns:
The retain ratio of large gradient data. Only used in goss.
- Return type:
topRate
- getUniformDrop()[source]
- Returns:
Set this to true to use uniform drop in dart mode
- Return type:
uniformDrop
- getUseBarrierExecutionMode()[source]
- Returns:
Barrier execution mode which uses a barrier stage, off by default.
- Return type:
useBarrierExecutionMode
- getUseMissing()[source]
- Returns:
Set this to false to disable the special handle of missing value
- Return type:
useMissing
- getUseSingleDatasetMode()[source]
- Returns:
Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.
- Return type:
useSingleDatasetMode
- getValidationIndicatorCol()[source]
- Returns:
Indicates whether the row is for training or validation
- Return type:
validationIndicatorCol
- getVerbosity()[source]
- Returns:
Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
- Return type:
verbosity
- getXGBoostDartMode()[source]
- Returns:
Set this to true to use xgboost dart mode
- Return type:
xGBoostDartMode
- getZeroAsMissing()[source]
- Returns:
Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values
- Return type:
zeroAsMissing
- improvementTolerance = Param(parent='undefined', name='improvementTolerance', doc='Tolerance to consider improvement in metric')
- initScoreCol = Param(parent='undefined', name='initScoreCol', doc='The name of the initial score column, used for continued training')
- isEnableSparse = Param(parent='undefined', name='isEnableSparse', doc='Used to enable/disable sparse optimization')
- isProvideTrainingMetric = Param(parent='undefined', name='isProvideTrainingMetric', doc='Whether output metric result over training dataset.')
- isUnbalance = Param(parent='undefined', name='isUnbalance', doc='Set to true if training data is unbalanced in binary classification scenario')
- labelCol = Param(parent='undefined', name='labelCol', doc='label column name')
- lambdaL1 = Param(parent='undefined', name='lambdaL1', doc='L1 regularization')
- lambdaL2 = Param(parent='undefined', name='lambdaL2', doc='L2 regularization')
- leafPredictionCol = Param(parent='undefined', name='leafPredictionCol', doc="Predicted leaf indices's column name")
- learningRate = Param(parent='undefined', name='learningRate', doc='Learning rate or shrinkage rate')
- matrixType = Param(parent='undefined', name='matrixType', doc='Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.')
- maxBin = Param(parent='undefined', name='maxBin', doc='Max bin')
- maxBinByFeature = Param(parent='undefined', name='maxBinByFeature', doc='Max number of bins for each feature')
- maxCatThreshold = Param(parent='undefined', name='maxCatThreshold', doc='limit number of split points considered for categorical features')
- maxCatToOnehot = Param(parent='undefined', name='maxCatToOnehot', doc='when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used')
- maxDeltaStep = Param(parent='undefined', name='maxDeltaStep', doc='Used to limit the max output of tree leaves')
- maxDepth = Param(parent='undefined', name='maxDepth', doc='Max depth')
- maxDrop = Param(parent='undefined', name='maxDrop', doc='Max number of dropped trees during one boosting iteration')
- maxNumClasses = Param(parent='undefined', name='maxNumClasses', doc='Number of max classes to infer numClass in multi-class classification.')
- maxStreamingOMPThreads = Param(parent='undefined', name='maxStreamingOMPThreads', doc="Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it's best to set a fixed value.")
- metric = Param(parent='undefined', name='metric', doc='Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv. ')
- microBatchSize = Param(parent='undefined', name='microBatchSize', doc='Specify how many elements are sent in a streaming micro-batch.')
- minDataInLeaf = Param(parent='undefined', name='minDataInLeaf', doc='Minimal number of data in one leaf. Can be used to deal with over-fitting.')
- minDataPerBin = Param(parent='undefined', name='minDataPerBin', doc='Minimal number of data inside one bin')
- minDataPerGroup = Param(parent='undefined', name='minDataPerGroup', doc='minimal number of data per categorical group')
- minGainToSplit = Param(parent='undefined', name='minGainToSplit', doc='The minimal gain to perform split')
- minSumHessianInLeaf = Param(parent='undefined', name='minSumHessianInLeaf', doc='Minimal sum hessian in one leaf')
- modelString = Param(parent='undefined', name='modelString', doc='LightGBM model to retrain')
- monotoneConstraints = Param(parent='undefined', name='monotoneConstraints', doc='used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.')
- monotoneConstraintsMethod = Param(parent='undefined', name='monotoneConstraintsMethod', doc='Monotone constraints method. basic, intermediate, or advanced.')
- monotonePenalty = Param(parent='undefined', name='monotonePenalty', doc='A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.')
- negBaggingFraction = Param(parent='undefined', name='negBaggingFraction', doc='Negative Bagging fraction')
- numBatches = Param(parent='undefined', name='numBatches', doc='If greater than 0, splits data into separate batches during training')
- numIterations = Param(parent='undefined', name='numIterations', doc='Number of iterations, LightGBM constructs num_class * num_iterations trees')
- numLeaves = Param(parent='undefined', name='numLeaves', doc='Number of leaves')
- numTasks = Param(parent='undefined', name='numTasks', doc='Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.')
- numThreads = Param(parent='undefined', name='numThreads', doc='Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.')
- objective = Param(parent='undefined', name='objective', doc='The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova. ')
- objectiveSeed = Param(parent='undefined', name='objectiveSeed', doc='Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.')
- otherRate = Param(parent='undefined', name='otherRate', doc='The retain ratio of small gradient data. Only used in goss.')
- parallelism = Param(parent='undefined', name='parallelism', doc='Tree learner parallelism, can be set to data_parallel or voting_parallel')
- passThroughArgs = Param(parent='undefined', name='passThroughArgs', doc='Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true')
- posBaggingFraction = Param(parent='undefined', name='posBaggingFraction', doc='Positive Bagging fraction')
- predictDisableShapeCheck = Param(parent='undefined', name='predictDisableShapeCheck', doc='control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data')
- predictionCol = Param(parent='undefined', name='predictionCol', doc='prediction column name')
- probabilityCol = Param(parent='undefined', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities')
- rawPredictionCol = Param(parent='undefined', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name')
- referenceDataset = Param(parent='undefined', name='referenceDataset', doc='The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().')
- repartitionByGroupingColumn = Param(parent='undefined', name='repartitionByGroupingColumn', doc='Repartition training data according to grouping column, on by default.')
- samplingMode = Param(parent='undefined', name='samplingMode', doc="Data sampling for streaming mode. Sampled data is used to define bins. 'global': sample from all data, 'subset': sample from first N rows, or 'fixed': Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.")
- samplingSubsetSize = Param(parent='undefined', name='samplingSubsetSize', doc="Specify subset size N for the sampling mode 'subset'. 'binSampleCount' rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.")
- seed = Param(parent='undefined', name='seed', doc='Main seed, used to generate other seeds')
- setBinSampleCount(value)[source]
- Parameters:
binSampleCount¶ – Number of samples considered at computing histogram bins
- setBoostFromAverage(value)[source]
- Parameters:
boostFromAverage¶ – Adjusts initial score to the mean of labels for faster convergence
- setBoostingType(value)[source]
- Parameters:
boostingType¶ – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
- setCatSmooth(value)[source]
- Parameters:
catSmooth¶ – this can reduce the effect of noises in categorical features, especially for categories with few data
- setCategoricalSlotIndexes(value)[source]
- Parameters:
categoricalSlotIndexes¶ – List of categorical column indexes, the slot index in the features column
- setCategoricalSlotNames(value)[source]
- Parameters:
categoricalSlotNames¶ – List of categorical column slot names, the slot name in the features column
- setChunkSize(value)[source]
- Parameters:
chunkSize¶ – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
- setDataRandomSeed(value)[source]
- Parameters:
dataRandomSeed¶ – Random seed for sampling data to construct histogram bins.
- setDataTransferMode(value)[source]
- Parameters:
dataTransferMode¶ – Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.
- setDefaultListenPort(value)[source]
- Parameters:
defaultListenPort¶ – The default listen port on executors, used for testing
- setDeterministic(value)[source]
- Parameters:
deterministic¶ – Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true
- setDriverListenPort(value)[source]
- Parameters:
driverListenPort¶ – The listen port on a driver. Default value is 0 (random)
- setDropRate(value)[source]
- Parameters:
dropRate¶ – Dropout rate: a fraction of previous trees to drop during the dropout
- setDropSeed(value)[source]
- Parameters:
dropSeed¶ – Random seed to choose dropping models. Only used in dart.
- setExecutionMode(value)[source]
- Parameters:
executionMode¶ – Deprecated. Please use dataTransferMode.
- setExtraSeed(value)[source]
- Parameters:
extraSeed¶ – Random seed for selecting threshold when extra_trees is true
- setFeatureFractionByNode(value)[source]
- Parameters:
featureFractionByNode¶ – Feature fraction by node
- setFeaturesShapCol(value)[source]
- Parameters:
featuresShapCol¶ – Output SHAP vector column name after prediction containing the feature contribution values
- setFobj(value)[source]
- Parameters:
fobj¶ – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
- setImprovementTolerance(value)[source]
- Parameters:
improvementTolerance¶ – Tolerance to consider improvement in metric
- setInitScoreCol(value)[source]
- Parameters:
initScoreCol¶ – The name of the initial score column, used for continued training
- setIsEnableSparse(value)[source]
- Parameters:
isEnableSparse¶ – Used to enable/disable sparse optimization
- setIsProvideTrainingMetric(value)[source]
- Parameters:
isProvideTrainingMetric¶ – Whether output metric result over training dataset.
- setIsUnbalance(value)[source]
- Parameters:
isUnbalance¶ – Set to true if training data is unbalanced in binary classification scenario
- setLeafPredictionCol(value)[source]
- Parameters:
leafPredictionCol¶ – Predicted leaf indices’s column name
- setMatrixType(value)[source]
- Parameters:
matrixType¶ – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
- setMaxBinByFeature(value)[source]
- Parameters:
maxBinByFeature¶ – Max number of bins for each feature
- setMaxCatThreshold(value)[source]
- Parameters:
maxCatThreshold¶ – limit number of split points considered for categorical features
- setMaxCatToOnehot(value)[source]
- Parameters:
maxCatToOnehot¶ – when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used
- setMaxDeltaStep(value)[source]
- Parameters:
maxDeltaStep¶ – Used to limit the max output of tree leaves
- setMaxDrop(value)[source]
- Parameters:
maxDrop¶ – Max number of dropped trees during one boosting iteration
- setMaxNumClasses(value)[source]
- Parameters:
maxNumClasses¶ – Number of max classes to infer numClass in multi-class classification.
- setMaxStreamingOMPThreads(value)[source]
- Parameters:
maxStreamingOMPThreads¶ – Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.
- setMetric(value)[source]
- Parameters:
metric¶ – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
- setMicroBatchSize(value)[source]
- Parameters:
microBatchSize¶ – Specify how many elements are sent in a streaming micro-batch.
- setMinDataInLeaf(value)[source]
- Parameters:
minDataInLeaf¶ – Minimal number of data in one leaf. Can be used to deal with over-fitting.
- setMinDataPerGroup(value)[source]
- Parameters:
minDataPerGroup¶ – minimal number of data per categorical group
- setMinSumHessianInLeaf(value)[source]
- Parameters:
minSumHessianInLeaf¶ – Minimal sum hessian in one leaf
- setMonotoneConstraints(value)[source]
- Parameters:
monotoneConstraints¶ – used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.
- setMonotoneConstraintsMethod(value)[source]
- Parameters:
monotoneConstraintsMethod¶ – Monotone constraints method. basic, intermediate, or advanced.
- setMonotonePenalty(value)[source]
- Parameters:
monotonePenalty¶ – A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.
- setNumBatches(value)[source]
- Parameters:
numBatches¶ – If greater than 0, splits data into separate batches during training
- setNumIterations(value)[source]
- Parameters:
numIterations¶ – Number of iterations, LightGBM constructs num_class * num_iterations trees
- setNumTasks(value)[source]
- Parameters:
numTasks¶ – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
- setNumThreads(value)[source]
- Parameters:
numThreads¶ – Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.
- setObjective(value)[source]
- Parameters:
objective¶ – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
- setObjectiveSeed(value)[source]
- Parameters:
objectiveSeed¶ – Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.
- setOtherRate(value)[source]
- Parameters:
otherRate¶ – The retain ratio of small gradient data. Only used in goss.
- setParallelism(value)[source]
- Parameters:
parallelism¶ – Tree learner parallelism, can be set to data_parallel or voting_parallel
- setParams(baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', catSmooth=10.0, categoricalSlotIndexes=[], categoricalSlotNames=[], catl2=10.0, chunkSize=10000, dataRandomSeed=1, dataTransferMode='streaming', defaultListenPort=12400, deterministic=False, driverListenPort=0, dropRate=0.1, dropSeed=4, earlyStoppingRound=0, executionMode=None, extraSeed=6, featureFraction=1.0, featureFractionByNode=None, featureFractionSeed=2, featuresCol='features', featuresShapCol='', fobj=None, improvementTolerance=0.0, initScoreCol=None, isEnableSparse=True, isProvideTrainingMetric=False, isUnbalance=False, labelCol='label', lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxCatThreshold=32, maxCatToOnehot=4, maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxNumClasses=100, maxStreamingOMPThreads=16, metric='', microBatchSize=100, minDataInLeaf=20, minDataPerBin=3, minDataPerGroup=100, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', monotoneConstraints=[], monotoneConstraintsMethod='basic', monotonePenalty=0.0, negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='binary', objectiveSeed=5, otherRate=0.1, parallelism='data_parallel', passThroughArgs='', posBaggingFraction=1.0, predictDisableShapeCheck=False, predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', referenceDataset=None, repartitionByGroupingColumn=True, samplingMode='subset', samplingSubsetSize=1000000, seed=None, skipDrop=0.5, slotNames=[], thresholds=None, timeout=1200.0, topK=20, topRate=0.2, uniformDrop=False, useBarrierExecutionMode=False, useMissing=True, useSingleDatasetMode=True, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xGBoostDartMode=False, zeroAsMissing=False)[source]
Set the (keyword only) parameters
- setPassThroughArgs(value)[source]
- Parameters:
passThroughArgs¶ – Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true
- setPredictDisableShapeCheck(value)[source]
- Parameters:
predictDisableShapeCheck¶ – control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data
- setProbabilityCol(value)[source]
- Parameters:
probabilityCol¶ – Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities
- setRawPredictionCol(value)[source]
- Parameters:
rawPredictionCol¶ – raw prediction (a.k.a. confidence) column name
- setReferenceDataset(value)[source]
- Parameters:
referenceDataset¶ – The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().
- setRepartitionByGroupingColumn(value)[source]
- Parameters:
repartitionByGroupingColumn¶ – Repartition training data according to grouping column, on by default.
- setSamplingMode(value)[source]
- Parameters:
samplingMode¶ – Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.
- setSamplingSubsetSize(value)[source]
- Parameters:
samplingSubsetSize¶ – Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.
- setSkipDrop(value)[source]
- Parameters:
skipDrop¶ – Probability of skipping the dropout procedure during a boosting iteration
- setThresholds(value)[source]
- Parameters:
thresholds¶ – Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class’s threshold
- setTopK(value)[source]
- Parameters:
topK¶ – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
- setTopRate(value)[source]
- Parameters:
topRate¶ – The retain ratio of large gradient data. Only used in goss.
- setUniformDrop(value)[source]
- Parameters:
uniformDrop¶ – Set this to true to use uniform drop in dart mode
- setUseBarrierExecutionMode(value)[source]
- Parameters:
useBarrierExecutionMode¶ – Barrier execution mode which uses a barrier stage, off by default.
- setUseMissing(value)[source]
- Parameters:
useMissing¶ – Set this to false to disable the special handle of missing value
- setUseSingleDatasetMode(value)[source]
- Parameters:
useSingleDatasetMode¶ – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.
- setValidationIndicatorCol(value)[source]
- Parameters:
validationIndicatorCol¶ – Indicates whether the row is for training or validation
- setVerbosity(value)[source]
- Parameters:
verbosity¶ – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
- setXGBoostDartMode(value)[source]
- Parameters:
xGBoostDartMode¶ – Set this to true to use xgboost dart mode
- setZeroAsMissing(value)[source]
- Parameters:
zeroAsMissing¶ – Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values
- skipDrop = Param(parent='undefined', name='skipDrop', doc='Probability of skipping the dropout procedure during a boosting iteration')
- slotNames = Param(parent='undefined', name='slotNames', doc='List of slot names in the features column')
- thresholds = Param(parent='undefined', name='thresholds', doc="Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold")
- timeout = Param(parent='undefined', name='timeout', doc='Timeout in seconds')
- topK = Param(parent='undefined', name='topK', doc='The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0')
- topRate = Param(parent='undefined', name='topRate', doc='The retain ratio of large gradient data. Only used in goss.')
- uniformDrop = Param(parent='undefined', name='uniformDrop', doc='Set this to true to use uniform drop in dart mode')
- useBarrierExecutionMode = Param(parent='undefined', name='useBarrierExecutionMode', doc='Barrier execution mode which uses a barrier stage, off by default.')
- useMissing = Param(parent='undefined', name='useMissing', doc='Set this to false to disable the special handle of missing value')
- useSingleDatasetMode = Param(parent='undefined', name='useSingleDatasetMode', doc='Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.')
- validationIndicatorCol = Param(parent='undefined', name='validationIndicatorCol', doc='Indicates whether the row is for training or validation')
- verbosity = Param(parent='undefined', name='verbosity', doc='Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug')
- weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')
- xGBoostDartMode = Param(parent='undefined', name='xGBoostDartMode', doc='Set this to true to use xgboost dart mode')
- zeroAsMissing = Param(parent='undefined', name='zeroAsMissing', doc='Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values')
synapse.ml.lightgbm.LightGBMRanker module
- class synapse.ml.lightgbm.LightGBMRanker.LightGBMRanker(java_obj=None, baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', catSmooth=10.0, categoricalSlotIndexes=[], categoricalSlotNames=[], catl2=10.0, chunkSize=10000, dataRandomSeed=1, dataTransferMode='streaming', defaultListenPort=12400, deterministic=False, driverListenPort=0, dropRate=0.1, dropSeed=4, earlyStoppingRound=0, evalAt=[1, 2, 3, 4, 5], executionMode=None, extraSeed=6, featureFraction=1.0, featureFractionByNode=None, featureFractionSeed=2, featuresCol='features', featuresShapCol='', fobj=None, groupCol=None, improvementTolerance=0.0, initScoreCol=None, isEnableSparse=True, isProvideTrainingMetric=False, labelCol='label', labelGain=[], lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxCatThreshold=32, maxCatToOnehot=4, maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxNumClasses=100, maxPosition=20, maxStreamingOMPThreads=16, metric='', microBatchSize=100, minDataInLeaf=20, minDataPerBin=3, minDataPerGroup=100, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', monotoneConstraints=[], monotoneConstraintsMethod='basic', monotonePenalty=0.0, negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='lambdarank', objectiveSeed=5, otherRate=0.1, parallelism='data_parallel', passThroughArgs='', posBaggingFraction=1.0, predictDisableShapeCheck=False, predictionCol='prediction', referenceDataset=None, repartitionByGroupingColumn=True, samplingMode='subset', samplingSubsetSize=1000000, seed=None, skipDrop=0.5, slotNames=[], timeout=1200.0, topK=20, topRate=0.2, uniformDrop=False, useBarrierExecutionMode=False, useMissing=True, useSingleDatasetMode=True, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xGBoostDartMode=False, zeroAsMissing=False)[source]
Bases:
ComplexParamsMixin
,JavaMLReadable
,JavaMLWritable
,JavaEstimator
- Parameters:
binSampleCount¶ (int) – Number of samples considered at computing histogram bins
boostFromAverage¶ (bool) – Adjusts initial score to the mean of labels for faster convergence
boostingType¶ (str) – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
catSmooth¶ (float) – this can reduce the effect of noises in categorical features, especially for categories with few data
categoricalSlotIndexes¶ (list) – List of categorical column indexes, the slot index in the features column
categoricalSlotNames¶ (list) – List of categorical column slot names, the slot name in the features column
chunkSize¶ (int) – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
dataRandomSeed¶ (int) – Random seed for sampling data to construct histogram bins.
dataTransferMode¶ (str) – Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.
defaultListenPort¶ (int) – The default listen port on executors, used for testing
deterministic¶ (bool) – Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true
driverListenPort¶ (int) – The listen port on a driver. Default value is 0 (random)
dropRate¶ (float) – Dropout rate: a fraction of previous trees to drop during the dropout
dropSeed¶ (int) – Random seed to choose dropping models. Only used in dart.
evalAt¶ (list) – NDCG and MAP evaluation positions, separated by comma
executionMode¶ (str) – Deprecated. Please use dataTransferMode.
extraSeed¶ (int) – Random seed for selecting threshold when extra_trees is true
featuresShapCol¶ (str) – Output SHAP vector column name after prediction containing the feature contribution values
fobj¶ (object) – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
improvementTolerance¶ (float) – Tolerance to consider improvement in metric
initScoreCol¶ (str) – The name of the initial score column, used for continued training
isEnableSparse¶ (bool) – Used to enable/disable sparse optimization
isProvideTrainingMetric¶ (bool) – Whether output metric result over training dataset.
leafPredictionCol¶ (str) – Predicted leaf indices’s column name
matrixType¶ (str) – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
maxBinByFeature¶ (list) – Max number of bins for each feature
maxCatThreshold¶ (int) – limit number of split points considered for categorical features
maxCatToOnehot¶ (int) – when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used
maxDeltaStep¶ (float) – Used to limit the max output of tree leaves
maxDrop¶ (int) – Max number of dropped trees during one boosting iteration
maxNumClasses¶ (int) – Number of max classes to infer numClass in multi-class classification.
maxStreamingOMPThreads¶ (int) – Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.
metric¶ (str) – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
microBatchSize¶ (int) – Specify how many elements are sent in a streaming micro-batch.
minDataInLeaf¶ (int) – Minimal number of data in one leaf. Can be used to deal with over-fitting.
minDataPerBin¶ (int) – Minimal number of data inside one bin
minDataPerGroup¶ (int) – minimal number of data per categorical group
minSumHessianInLeaf¶ (float) – Minimal sum hessian in one leaf
monotoneConstraints¶ (list) – used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.
monotoneConstraintsMethod¶ (str) – Monotone constraints method. basic, intermediate, or advanced.
monotonePenalty¶ (float) – A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.
numBatches¶ (int) – If greater than 0, splits data into separate batches during training
numIterations¶ (int) – Number of iterations, LightGBM constructs num_class * num_iterations trees
numTasks¶ (int) – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
numThreads¶ (int) – Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.
objective¶ (str) – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
objectiveSeed¶ (int) – Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.
otherRate¶ (float) – The retain ratio of small gradient data. Only used in goss.
parallelism¶ (str) – Tree learner parallelism, can be set to data_parallel or voting_parallel
passThroughArgs¶ (str) – Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true
predictDisableShapeCheck¶ (bool) – control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data
referenceDataset¶ (list) – The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().
repartitionByGroupingColumn¶ (bool) – Repartition training data according to grouping column, on by default.
samplingMode¶ (str) – Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.
samplingSubsetSize¶ (int) – Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.
skipDrop¶ (float) – Probability of skipping the dropout procedure during a boosting iteration
slotNames¶ (list) – List of slot names in the features column
topK¶ (int) – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
topRate¶ (float) – The retain ratio of large gradient data. Only used in goss.
uniformDrop¶ (bool) – Set this to true to use uniform drop in dart mode
useBarrierExecutionMode¶ (bool) – Barrier execution mode which uses a barrier stage, off by default.
useMissing¶ (bool) – Set this to false to disable the special handle of missing value
useSingleDatasetMode¶ (bool) – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.
validationIndicatorCol¶ (str) – Indicates whether the row is for training or validation
verbosity¶ (int) – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
xGBoostDartMode¶ (bool) – Set this to true to use xgboost dart mode
zeroAsMissing¶ (bool) – Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values
- baggingFraction = Param(parent='undefined', name='baggingFraction', doc='Bagging fraction')
- baggingFreq = Param(parent='undefined', name='baggingFreq', doc='Bagging frequency')
- baggingSeed = Param(parent='undefined', name='baggingSeed', doc='Bagging seed')
- binSampleCount = Param(parent='undefined', name='binSampleCount', doc='Number of samples considered at computing histogram bins')
- boostFromAverage = Param(parent='undefined', name='boostFromAverage', doc='Adjusts initial score to the mean of labels for faster convergence')
- boostingType = Param(parent='undefined', name='boostingType', doc='Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling). ')
- catSmooth = Param(parent='undefined', name='catSmooth', doc='this can reduce the effect of noises in categorical features, especially for categories with few data')
- categoricalSlotIndexes = Param(parent='undefined', name='categoricalSlotIndexes', doc='List of categorical column indexes, the slot index in the features column')
- categoricalSlotNames = Param(parent='undefined', name='categoricalSlotNames', doc='List of categorical column slot names, the slot name in the features column')
- catl2 = Param(parent='undefined', name='catl2', doc='L2 regularization in categorical split')
- chunkSize = Param(parent='undefined', name='chunkSize', doc='Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.')
- dataRandomSeed = Param(parent='undefined', name='dataRandomSeed', doc='Random seed for sampling data to construct histogram bins.')
- dataTransferMode = Param(parent='undefined', name='dataTransferMode', doc='Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.')
- defaultListenPort = Param(parent='undefined', name='defaultListenPort', doc='The default listen port on executors, used for testing')
- deterministic = Param(parent='undefined', name='deterministic', doc='Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true')
- driverListenPort = Param(parent='undefined', name='driverListenPort', doc='The listen port on a driver. Default value is 0 (random)')
- dropRate = Param(parent='undefined', name='dropRate', doc='Dropout rate: a fraction of previous trees to drop during the dropout')
- dropSeed = Param(parent='undefined', name='dropSeed', doc='Random seed to choose dropping models. Only used in dart.')
- earlyStoppingRound = Param(parent='undefined', name='earlyStoppingRound', doc='Early stopping round')
- evalAt = Param(parent='undefined', name='evalAt', doc='NDCG and MAP evaluation positions, separated by comma')
- executionMode = Param(parent='undefined', name='executionMode', doc='Deprecated. Please use dataTransferMode.')
- extraSeed = Param(parent='undefined', name='extraSeed', doc='Random seed for selecting threshold when extra_trees is true')
- featureFraction = Param(parent='undefined', name='featureFraction', doc='Feature fraction')
- featureFractionByNode = Param(parent='undefined', name='featureFractionByNode', doc='Feature fraction by node')
- featureFractionSeed = Param(parent='undefined', name='featureFractionSeed', doc='Feature fraction seed')
- featuresCol = Param(parent='undefined', name='featuresCol', doc='features column name')
- featuresShapCol = Param(parent='undefined', name='featuresShapCol', doc='Output SHAP vector column name after prediction containing the feature contribution values')
- fobj = Param(parent='undefined', name='fobj', doc='Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).')
- getBinSampleCount()[source]
- Returns:
Number of samples considered at computing histogram bins
- Return type:
binSampleCount
- getBoostFromAverage()[source]
- Returns:
Adjusts initial score to the mean of labels for faster convergence
- Return type:
boostFromAverage
- getBoostingType()[source]
- Returns:
Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
- Return type:
boostingType
- getCatSmooth()[source]
- Returns:
this can reduce the effect of noises in categorical features, especially for categories with few data
- Return type:
catSmooth
- getCategoricalSlotIndexes()[source]
- Returns:
List of categorical column indexes, the slot index in the features column
- Return type:
categoricalSlotIndexes
- getCategoricalSlotNames()[source]
- Returns:
List of categorical column slot names, the slot name in the features column
- Return type:
categoricalSlotNames
- getChunkSize()[source]
- Returns:
Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
- Return type:
chunkSize
- getDataRandomSeed()[source]
- Returns:
Random seed for sampling data to construct histogram bins.
- Return type:
dataRandomSeed
- getDataTransferMode()[source]
- Returns:
Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.
- Return type:
dataTransferMode
- getDefaultListenPort()[source]
- Returns:
The default listen port on executors, used for testing
- Return type:
defaultListenPort
- getDeterministic()[source]
- Returns:
Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true
- Return type:
deterministic
- getDriverListenPort()[source]
- Returns:
The listen port on a driver. Default value is 0 (random)
- Return type:
driverListenPort
- getDropRate()[source]
- Returns:
Dropout rate: a fraction of previous trees to drop during the dropout
- Return type:
dropRate
- getDropSeed()[source]
- Returns:
Random seed to choose dropping models. Only used in dart.
- Return type:
dropSeed
- getEvalAt()[source]
- Returns:
NDCG and MAP evaluation positions, separated by comma
- Return type:
evalAt
- getExecutionMode()[source]
- Returns:
Deprecated. Please use dataTransferMode.
- Return type:
executionMode
- getExtraSeed()[source]
- Returns:
Random seed for selecting threshold when extra_trees is true
- Return type:
extraSeed
- getFeatureFractionByNode()[source]
- Returns:
Feature fraction by node
- Return type:
featureFractionByNode
- getFeaturesShapCol()[source]
- Returns:
Output SHAP vector column name after prediction containing the feature contribution values
- Return type:
featuresShapCol
- getFobj()[source]
- Returns:
Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
- Return type:
fobj
- getImprovementTolerance()[source]
- Returns:
Tolerance to consider improvement in metric
- Return type:
improvementTolerance
- getInitScoreCol()[source]
- Returns:
The name of the initial score column, used for continued training
- Return type:
initScoreCol
- getIsEnableSparse()[source]
- Returns:
Used to enable/disable sparse optimization
- Return type:
isEnableSparse
- getIsProvideTrainingMetric()[source]
- Returns:
Whether output metric result over training dataset.
- Return type:
isProvideTrainingMetric
- getLeafPredictionCol()[source]
- Returns:
Predicted leaf indices’s column name
- Return type:
leafPredictionCol
- getMatrixType()[source]
- Returns:
Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
- Return type:
matrixType
- getMaxBinByFeature()[source]
- Returns:
Max number of bins for each feature
- Return type:
maxBinByFeature
- getMaxCatThreshold()[source]
- Returns:
limit number of split points considered for categorical features
- Return type:
maxCatThreshold
- getMaxCatToOnehot()[source]
- Returns:
when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used
- Return type:
maxCatToOnehot
- getMaxDeltaStep()[source]
- Returns:
Used to limit the max output of tree leaves
- Return type:
maxDeltaStep
- getMaxDrop()[source]
- Returns:
Max number of dropped trees during one boosting iteration
- Return type:
maxDrop
- getMaxNumClasses()[source]
- Returns:
Number of max classes to infer numClass in multi-class classification.
- Return type:
maxNumClasses
- getMaxStreamingOMPThreads()[source]
- Returns:
Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.
- Return type:
maxStreamingOMPThreads
- getMetric()[source]
- Returns:
Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
- Return type:
metric
- getMicroBatchSize()[source]
- Returns:
Specify how many elements are sent in a streaming micro-batch.
- Return type:
microBatchSize
- getMinDataInLeaf()[source]
- Returns:
Minimal number of data in one leaf. Can be used to deal with over-fitting.
- Return type:
minDataInLeaf
- getMinDataPerBin()[source]
- Returns:
Minimal number of data inside one bin
- Return type:
minDataPerBin
- getMinDataPerGroup()[source]
- Returns:
minimal number of data per categorical group
- Return type:
minDataPerGroup
- getMinSumHessianInLeaf()[source]
- Returns:
Minimal sum hessian in one leaf
- Return type:
minSumHessianInLeaf
- getMonotoneConstraints()[source]
- Returns:
used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.
- Return type:
monotoneConstraints
- getMonotoneConstraintsMethod()[source]
- Returns:
Monotone constraints method. basic, intermediate, or advanced.
- Return type:
monotoneConstraintsMethod
- getMonotonePenalty()[source]
- Returns:
A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.
- Return type:
monotonePenalty
- getNumBatches()[source]
- Returns:
If greater than 0, splits data into separate batches during training
- Return type:
numBatches
- getNumIterations()[source]
- Returns:
Number of iterations, LightGBM constructs num_class * num_iterations trees
- Return type:
numIterations
- getNumTasks()[source]
- Returns:
Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
- Return type:
numTasks
- getNumThreads()[source]
- Returns:
Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.
- Return type:
numThreads
- getObjective()[source]
- Returns:
The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
- Return type:
objective
- getObjectiveSeed()[source]
- Returns:
Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.
- Return type:
objectiveSeed
- getOtherRate()[source]
- Returns:
The retain ratio of small gradient data. Only used in goss.
- Return type:
otherRate
- getParallelism()[source]
- Returns:
Tree learner parallelism, can be set to data_parallel or voting_parallel
- Return type:
parallelism
- getPassThroughArgs()[source]
- Returns:
Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true
- Return type:
passThroughArgs
- getPredictDisableShapeCheck()[source]
- Returns:
control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data
- Return type:
predictDisableShapeCheck
- getReferenceDataset()[source]
- Returns:
The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().
- Return type:
referenceDataset
- getRepartitionByGroupingColumn()[source]
- Returns:
Repartition training data according to grouping column, on by default.
- Return type:
repartitionByGroupingColumn
- getSamplingMode()[source]
- Returns:
Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.
- Return type:
samplingMode
- getSamplingSubsetSize()[source]
- Returns:
Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.
- Return type:
samplingSubsetSize
- getSkipDrop()[source]
- Returns:
Probability of skipping the dropout procedure during a boosting iteration
- Return type:
skipDrop
- getTopK()[source]
- Returns:
The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
- Return type:
topK
- getTopRate()[source]
- Returns:
The retain ratio of large gradient data. Only used in goss.
- Return type:
topRate
- getUniformDrop()[source]
- Returns:
Set this to true to use uniform drop in dart mode
- Return type:
uniformDrop
- getUseBarrierExecutionMode()[source]
- Returns:
Barrier execution mode which uses a barrier stage, off by default.
- Return type:
useBarrierExecutionMode
- getUseMissing()[source]
- Returns:
Set this to false to disable the special handle of missing value
- Return type:
useMissing
- getUseSingleDatasetMode()[source]
- Returns:
Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.
- Return type:
useSingleDatasetMode
- getValidationIndicatorCol()[source]
- Returns:
Indicates whether the row is for training or validation
- Return type:
validationIndicatorCol
- getVerbosity()[source]
- Returns:
Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
- Return type:
verbosity
- getXGBoostDartMode()[source]
- Returns:
Set this to true to use xgboost dart mode
- Return type:
xGBoostDartMode
- getZeroAsMissing()[source]
- Returns:
Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values
- Return type:
zeroAsMissing
- groupCol = Param(parent='undefined', name='groupCol', doc='The name of the group column')
- improvementTolerance = Param(parent='undefined', name='improvementTolerance', doc='Tolerance to consider improvement in metric')
- initScoreCol = Param(parent='undefined', name='initScoreCol', doc='The name of the initial score column, used for continued training')
- isEnableSparse = Param(parent='undefined', name='isEnableSparse', doc='Used to enable/disable sparse optimization')
- isProvideTrainingMetric = Param(parent='undefined', name='isProvideTrainingMetric', doc='Whether output metric result over training dataset.')
- labelCol = Param(parent='undefined', name='labelCol', doc='label column name')
- labelGain = Param(parent='undefined', name='labelGain', doc='graded relevance for each label in NDCG')
- lambdaL1 = Param(parent='undefined', name='lambdaL1', doc='L1 regularization')
- lambdaL2 = Param(parent='undefined', name='lambdaL2', doc='L2 regularization')
- leafPredictionCol = Param(parent='undefined', name='leafPredictionCol', doc="Predicted leaf indices's column name")
- learningRate = Param(parent='undefined', name='learningRate', doc='Learning rate or shrinkage rate')
- matrixType = Param(parent='undefined', name='matrixType', doc='Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.')
- maxBin = Param(parent='undefined', name='maxBin', doc='Max bin')
- maxBinByFeature = Param(parent='undefined', name='maxBinByFeature', doc='Max number of bins for each feature')
- maxCatThreshold = Param(parent='undefined', name='maxCatThreshold', doc='limit number of split points considered for categorical features')
- maxCatToOnehot = Param(parent='undefined', name='maxCatToOnehot', doc='when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used')
- maxDeltaStep = Param(parent='undefined', name='maxDeltaStep', doc='Used to limit the max output of tree leaves')
- maxDepth = Param(parent='undefined', name='maxDepth', doc='Max depth')
- maxDrop = Param(parent='undefined', name='maxDrop', doc='Max number of dropped trees during one boosting iteration')
- maxNumClasses = Param(parent='undefined', name='maxNumClasses', doc='Number of max classes to infer numClass in multi-class classification.')
- maxPosition = Param(parent='undefined', name='maxPosition', doc='optimized NDCG at this position')
- maxStreamingOMPThreads = Param(parent='undefined', name='maxStreamingOMPThreads', doc="Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it's best to set a fixed value.")
- metric = Param(parent='undefined', name='metric', doc='Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv. ')
- microBatchSize = Param(parent='undefined', name='microBatchSize', doc='Specify how many elements are sent in a streaming micro-batch.')
- minDataInLeaf = Param(parent='undefined', name='minDataInLeaf', doc='Minimal number of data in one leaf. Can be used to deal with over-fitting.')
- minDataPerBin = Param(parent='undefined', name='minDataPerBin', doc='Minimal number of data inside one bin')
- minDataPerGroup = Param(parent='undefined', name='minDataPerGroup', doc='minimal number of data per categorical group')
- minGainToSplit = Param(parent='undefined', name='minGainToSplit', doc='The minimal gain to perform split')
- minSumHessianInLeaf = Param(parent='undefined', name='minSumHessianInLeaf', doc='Minimal sum hessian in one leaf')
- modelString = Param(parent='undefined', name='modelString', doc='LightGBM model to retrain')
- monotoneConstraints = Param(parent='undefined', name='monotoneConstraints', doc='used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.')
- monotoneConstraintsMethod = Param(parent='undefined', name='monotoneConstraintsMethod', doc='Monotone constraints method. basic, intermediate, or advanced.')
- monotonePenalty = Param(parent='undefined', name='monotonePenalty', doc='A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.')
- negBaggingFraction = Param(parent='undefined', name='negBaggingFraction', doc='Negative Bagging fraction')
- numBatches = Param(parent='undefined', name='numBatches', doc='If greater than 0, splits data into separate batches during training')
- numIterations = Param(parent='undefined', name='numIterations', doc='Number of iterations, LightGBM constructs num_class * num_iterations trees')
- numLeaves = Param(parent='undefined', name='numLeaves', doc='Number of leaves')
- numTasks = Param(parent='undefined', name='numTasks', doc='Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.')
- numThreads = Param(parent='undefined', name='numThreads', doc='Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.')
- objective = Param(parent='undefined', name='objective', doc='The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova. ')
- objectiveSeed = Param(parent='undefined', name='objectiveSeed', doc='Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.')
- otherRate = Param(parent='undefined', name='otherRate', doc='The retain ratio of small gradient data. Only used in goss.')
- parallelism = Param(parent='undefined', name='parallelism', doc='Tree learner parallelism, can be set to data_parallel or voting_parallel')
- passThroughArgs = Param(parent='undefined', name='passThroughArgs', doc='Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true')
- posBaggingFraction = Param(parent='undefined', name='posBaggingFraction', doc='Positive Bagging fraction')
- predictDisableShapeCheck = Param(parent='undefined', name='predictDisableShapeCheck', doc='control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data')
- predictionCol = Param(parent='undefined', name='predictionCol', doc='prediction column name')
- referenceDataset = Param(parent='undefined', name='referenceDataset', doc='The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().')
- repartitionByGroupingColumn = Param(parent='undefined', name='repartitionByGroupingColumn', doc='Repartition training data according to grouping column, on by default.')
- samplingMode = Param(parent='undefined', name='samplingMode', doc="Data sampling for streaming mode. Sampled data is used to define bins. 'global': sample from all data, 'subset': sample from first N rows, or 'fixed': Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.")
- samplingSubsetSize = Param(parent='undefined', name='samplingSubsetSize', doc="Specify subset size N for the sampling mode 'subset'. 'binSampleCount' rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.")
- seed = Param(parent='undefined', name='seed', doc='Main seed, used to generate other seeds')
- setBinSampleCount(value)[source]
- Parameters:
binSampleCount¶ – Number of samples considered at computing histogram bins
- setBoostFromAverage(value)[source]
- Parameters:
boostFromAverage¶ – Adjusts initial score to the mean of labels for faster convergence
- setBoostingType(value)[source]
- Parameters:
boostingType¶ – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
- setCatSmooth(value)[source]
- Parameters:
catSmooth¶ – this can reduce the effect of noises in categorical features, especially for categories with few data
- setCategoricalSlotIndexes(value)[source]
- Parameters:
categoricalSlotIndexes¶ – List of categorical column indexes, the slot index in the features column
- setCategoricalSlotNames(value)[source]
- Parameters:
categoricalSlotNames¶ – List of categorical column slot names, the slot name in the features column
- setChunkSize(value)[source]
- Parameters:
chunkSize¶ – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
- setDataRandomSeed(value)[source]
- Parameters:
dataRandomSeed¶ – Random seed for sampling data to construct histogram bins.
- setDataTransferMode(value)[source]
- Parameters:
dataTransferMode¶ – Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.
- setDefaultListenPort(value)[source]
- Parameters:
defaultListenPort¶ – The default listen port on executors, used for testing
- setDeterministic(value)[source]
- Parameters:
deterministic¶ – Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true
- setDriverListenPort(value)[source]
- Parameters:
driverListenPort¶ – The listen port on a driver. Default value is 0 (random)
- setDropRate(value)[source]
- Parameters:
dropRate¶ – Dropout rate: a fraction of previous trees to drop during the dropout
- setDropSeed(value)[source]
- Parameters:
dropSeed¶ – Random seed to choose dropping models. Only used in dart.
- setEvalAt(value)[source]
- Parameters:
evalAt¶ – NDCG and MAP evaluation positions, separated by comma
- setExecutionMode(value)[source]
- Parameters:
executionMode¶ – Deprecated. Please use dataTransferMode.
- setExtraSeed(value)[source]
- Parameters:
extraSeed¶ – Random seed for selecting threshold when extra_trees is true
- setFeatureFractionByNode(value)[source]
- Parameters:
featureFractionByNode¶ – Feature fraction by node
- setFeaturesShapCol(value)[source]
- Parameters:
featuresShapCol¶ – Output SHAP vector column name after prediction containing the feature contribution values
- setFobj(value)[source]
- Parameters:
fobj¶ – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
- setImprovementTolerance(value)[source]
- Parameters:
improvementTolerance¶ – Tolerance to consider improvement in metric
- setInitScoreCol(value)[source]
- Parameters:
initScoreCol¶ – The name of the initial score column, used for continued training
- setIsEnableSparse(value)[source]
- Parameters:
isEnableSparse¶ – Used to enable/disable sparse optimization
- setIsProvideTrainingMetric(value)[source]
- Parameters:
isProvideTrainingMetric¶ – Whether output metric result over training dataset.
- setLeafPredictionCol(value)[source]
- Parameters:
leafPredictionCol¶ – Predicted leaf indices’s column name
- setMatrixType(value)[source]
- Parameters:
matrixType¶ – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
- setMaxBinByFeature(value)[source]
- Parameters:
maxBinByFeature¶ – Max number of bins for each feature
- setMaxCatThreshold(value)[source]
- Parameters:
maxCatThreshold¶ – limit number of split points considered for categorical features
- setMaxCatToOnehot(value)[source]
- Parameters:
maxCatToOnehot¶ – when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used
- setMaxDeltaStep(value)[source]
- Parameters:
maxDeltaStep¶ – Used to limit the max output of tree leaves
- setMaxDrop(value)[source]
- Parameters:
maxDrop¶ – Max number of dropped trees during one boosting iteration
- setMaxNumClasses(value)[source]
- Parameters:
maxNumClasses¶ – Number of max classes to infer numClass in multi-class classification.
- setMaxStreamingOMPThreads(value)[source]
- Parameters:
maxStreamingOMPThreads¶ – Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.
- setMetric(value)[source]
- Parameters:
metric¶ – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
- setMicroBatchSize(value)[source]
- Parameters:
microBatchSize¶ – Specify how many elements are sent in a streaming micro-batch.
- setMinDataInLeaf(value)[source]
- Parameters:
minDataInLeaf¶ – Minimal number of data in one leaf. Can be used to deal with over-fitting.
- setMinDataPerGroup(value)[source]
- Parameters:
minDataPerGroup¶ – minimal number of data per categorical group
- setMinSumHessianInLeaf(value)[source]
- Parameters:
minSumHessianInLeaf¶ – Minimal sum hessian in one leaf
- setMonotoneConstraints(value)[source]
- Parameters:
monotoneConstraints¶ – used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.
- setMonotoneConstraintsMethod(value)[source]
- Parameters:
monotoneConstraintsMethod¶ – Monotone constraints method. basic, intermediate, or advanced.
- setMonotonePenalty(value)[source]
- Parameters:
monotonePenalty¶ – A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.
- setNumBatches(value)[source]
- Parameters:
numBatches¶ – If greater than 0, splits data into separate batches during training
- setNumIterations(value)[source]
- Parameters:
numIterations¶ – Number of iterations, LightGBM constructs num_class * num_iterations trees
- setNumTasks(value)[source]
- Parameters:
numTasks¶ – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
- setNumThreads(value)[source]
- Parameters:
numThreads¶ – Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.
- setObjective(value)[source]
- Parameters:
objective¶ – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
- setObjectiveSeed(value)[source]
- Parameters:
objectiveSeed¶ – Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.
- setOtherRate(value)[source]
- Parameters:
otherRate¶ – The retain ratio of small gradient data. Only used in goss.
- setParallelism(value)[source]
- Parameters:
parallelism¶ – Tree learner parallelism, can be set to data_parallel or voting_parallel
- setParams(baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', catSmooth=10.0, categoricalSlotIndexes=[], categoricalSlotNames=[], catl2=10.0, chunkSize=10000, dataRandomSeed=1, dataTransferMode='streaming', defaultListenPort=12400, deterministic=False, driverListenPort=0, dropRate=0.1, dropSeed=4, earlyStoppingRound=0, evalAt=[1, 2, 3, 4, 5], executionMode=None, extraSeed=6, featureFraction=1.0, featureFractionByNode=None, featureFractionSeed=2, featuresCol='features', featuresShapCol='', fobj=None, groupCol=None, improvementTolerance=0.0, initScoreCol=None, isEnableSparse=True, isProvideTrainingMetric=False, labelCol='label', labelGain=[], lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxCatThreshold=32, maxCatToOnehot=4, maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxNumClasses=100, maxPosition=20, maxStreamingOMPThreads=16, metric='', microBatchSize=100, minDataInLeaf=20, minDataPerBin=3, minDataPerGroup=100, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', monotoneConstraints=[], monotoneConstraintsMethod='basic', monotonePenalty=0.0, negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='lambdarank', objectiveSeed=5, otherRate=0.1, parallelism='data_parallel', passThroughArgs='', posBaggingFraction=1.0, predictDisableShapeCheck=False, predictionCol='prediction', referenceDataset=None, repartitionByGroupingColumn=True, samplingMode='subset', samplingSubsetSize=1000000, seed=None, skipDrop=0.5, slotNames=[], timeout=1200.0, topK=20, topRate=0.2, uniformDrop=False, useBarrierExecutionMode=False, useMissing=True, useSingleDatasetMode=True, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xGBoostDartMode=False, zeroAsMissing=False)[source]
Set the (keyword only) parameters
- setPassThroughArgs(value)[source]
- Parameters:
passThroughArgs¶ – Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true
- setPredictDisableShapeCheck(value)[source]
- Parameters:
predictDisableShapeCheck¶ – control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data
- setReferenceDataset(value)[source]
- Parameters:
referenceDataset¶ – The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().
- setRepartitionByGroupingColumn(value)[source]
- Parameters:
repartitionByGroupingColumn¶ – Repartition training data according to grouping column, on by default.
- setSamplingMode(value)[source]
- Parameters:
samplingMode¶ – Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.
- setSamplingSubsetSize(value)[source]
- Parameters:
samplingSubsetSize¶ – Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.
- setSkipDrop(value)[source]
- Parameters:
skipDrop¶ – Probability of skipping the dropout procedure during a boosting iteration
- setTopK(value)[source]
- Parameters:
topK¶ – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
- setTopRate(value)[source]
- Parameters:
topRate¶ – The retain ratio of large gradient data. Only used in goss.
- setUniformDrop(value)[source]
- Parameters:
uniformDrop¶ – Set this to true to use uniform drop in dart mode
- setUseBarrierExecutionMode(value)[source]
- Parameters:
useBarrierExecutionMode¶ – Barrier execution mode which uses a barrier stage, off by default.
- setUseMissing(value)[source]
- Parameters:
useMissing¶ – Set this to false to disable the special handle of missing value
- setUseSingleDatasetMode(value)[source]
- Parameters:
useSingleDatasetMode¶ – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.
- setValidationIndicatorCol(value)[source]
- Parameters:
validationIndicatorCol¶ – Indicates whether the row is for training or validation
- setVerbosity(value)[source]
- Parameters:
verbosity¶ – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
- setXGBoostDartMode(value)[source]
- Parameters:
xGBoostDartMode¶ – Set this to true to use xgboost dart mode
- setZeroAsMissing(value)[source]
- Parameters:
zeroAsMissing¶ – Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values
- skipDrop = Param(parent='undefined', name='skipDrop', doc='Probability of skipping the dropout procedure during a boosting iteration')
- slotNames = Param(parent='undefined', name='slotNames', doc='List of slot names in the features column')
- timeout = Param(parent='undefined', name='timeout', doc='Timeout in seconds')
- topK = Param(parent='undefined', name='topK', doc='The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0')
- topRate = Param(parent='undefined', name='topRate', doc='The retain ratio of large gradient data. Only used in goss.')
- uniformDrop = Param(parent='undefined', name='uniformDrop', doc='Set this to true to use uniform drop in dart mode')
- useBarrierExecutionMode = Param(parent='undefined', name='useBarrierExecutionMode', doc='Barrier execution mode which uses a barrier stage, off by default.')
- useMissing = Param(parent='undefined', name='useMissing', doc='Set this to false to disable the special handle of missing value')
- useSingleDatasetMode = Param(parent='undefined', name='useSingleDatasetMode', doc='Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.')
- validationIndicatorCol = Param(parent='undefined', name='validationIndicatorCol', doc='Indicates whether the row is for training or validation')
- verbosity = Param(parent='undefined', name='verbosity', doc='Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug')
- weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')
- xGBoostDartMode = Param(parent='undefined', name='xGBoostDartMode', doc='Set this to true to use xgboost dart mode')
- zeroAsMissing = Param(parent='undefined', name='zeroAsMissing', doc='Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values')
synapse.ml.lightgbm.LightGBMRankerModel module
- class synapse.ml.lightgbm.LightGBMRankerModel.LightGBMRankerModel(java_obj=None, featuresCol='features', featuresShapCol='', labelCol='label', leafPredictionCol='', lightGBMBooster=None, numIterations=- 1, predictDisableShapeCheck=False, predictionCol='prediction', startIteration=0)[source]
Bases:
LightGBMModelMixin
,_LightGBMRankerModel
synapse.ml.lightgbm.LightGBMRegressionModel module
- class synapse.ml.lightgbm.LightGBMRegressionModel.LightGBMRegressionModel(java_obj=None, featuresCol='features', featuresShapCol='', labelCol='label', leafPredictionCol='', lightGBMBooster=None, numIterations=- 1, predictDisableShapeCheck=False, predictionCol='prediction', startIteration=0)[source]
Bases:
LightGBMModelMixin
,_LightGBMRegressionModel
synapse.ml.lightgbm.LightGBMRegressor module
- class synapse.ml.lightgbm.LightGBMRegressor.LightGBMRegressor(java_obj=None, alpha=0.9, baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', catSmooth=10.0, categoricalSlotIndexes=[], categoricalSlotNames=[], catl2=10.0, chunkSize=10000, dataRandomSeed=1, dataTransferMode='streaming', defaultListenPort=12400, deterministic=False, driverListenPort=0, dropRate=0.1, dropSeed=4, earlyStoppingRound=0, executionMode=None, extraSeed=6, featureFraction=1.0, featureFractionByNode=None, featureFractionSeed=2, featuresCol='features', featuresShapCol='', fobj=None, improvementTolerance=0.0, initScoreCol=None, isEnableSparse=True, isProvideTrainingMetric=False, labelCol='label', lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxCatThreshold=32, maxCatToOnehot=4, maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxNumClasses=100, maxStreamingOMPThreads=16, metric='', microBatchSize=100, minDataInLeaf=20, minDataPerBin=3, minDataPerGroup=100, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', monotoneConstraints=[], monotoneConstraintsMethod='basic', monotonePenalty=0.0, negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='regression', objectiveSeed=5, otherRate=0.1, parallelism='data_parallel', passThroughArgs='', posBaggingFraction=1.0, predictDisableShapeCheck=False, predictionCol='prediction', referenceDataset=None, repartitionByGroupingColumn=True, samplingMode='subset', samplingSubsetSize=1000000, seed=None, skipDrop=0.5, slotNames=[], timeout=1200.0, topK=20, topRate=0.2, tweedieVariancePower=1.5, uniformDrop=False, useBarrierExecutionMode=False, useMissing=True, useSingleDatasetMode=True, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xGBoostDartMode=False, zeroAsMissing=False)[source]
Bases:
ComplexParamsMixin
,JavaMLReadable
,JavaMLWritable
,JavaEstimator
- Parameters:
alpha¶ (float) – parameter for Huber loss and Quantile regression
binSampleCount¶ (int) – Number of samples considered at computing histogram bins
boostFromAverage¶ (bool) – Adjusts initial score to the mean of labels for faster convergence
boostingType¶ (str) – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
catSmooth¶ (float) – this can reduce the effect of noises in categorical features, especially for categories with few data
categoricalSlotIndexes¶ (list) – List of categorical column indexes, the slot index in the features column
categoricalSlotNames¶ (list) – List of categorical column slot names, the slot name in the features column
chunkSize¶ (int) – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
dataRandomSeed¶ (int) – Random seed for sampling data to construct histogram bins.
dataTransferMode¶ (str) – Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.
defaultListenPort¶ (int) – The default listen port on executors, used for testing
deterministic¶ (bool) – Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true
driverListenPort¶ (int) – The listen port on a driver. Default value is 0 (random)
dropRate¶ (float) – Dropout rate: a fraction of previous trees to drop during the dropout
dropSeed¶ (int) – Random seed to choose dropping models. Only used in dart.
executionMode¶ (str) – Deprecated. Please use dataTransferMode.
extraSeed¶ (int) – Random seed for selecting threshold when extra_trees is true
featuresShapCol¶ (str) – Output SHAP vector column name after prediction containing the feature contribution values
fobj¶ (object) – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
improvementTolerance¶ (float) – Tolerance to consider improvement in metric
initScoreCol¶ (str) – The name of the initial score column, used for continued training
isEnableSparse¶ (bool) – Used to enable/disable sparse optimization
isProvideTrainingMetric¶ (bool) – Whether output metric result over training dataset.
leafPredictionCol¶ (str) – Predicted leaf indices’s column name
matrixType¶ (str) – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
maxBinByFeature¶ (list) – Max number of bins for each feature
maxCatThreshold¶ (int) – limit number of split points considered for categorical features
maxCatToOnehot¶ (int) – when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used
maxDeltaStep¶ (float) – Used to limit the max output of tree leaves
maxDrop¶ (int) – Max number of dropped trees during one boosting iteration
maxNumClasses¶ (int) – Number of max classes to infer numClass in multi-class classification.
maxStreamingOMPThreads¶ (int) – Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.
metric¶ (str) – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
microBatchSize¶ (int) – Specify how many elements are sent in a streaming micro-batch.
minDataInLeaf¶ (int) – Minimal number of data in one leaf. Can be used to deal with over-fitting.
minDataPerBin¶ (int) – Minimal number of data inside one bin
minDataPerGroup¶ (int) – minimal number of data per categorical group
minSumHessianInLeaf¶ (float) – Minimal sum hessian in one leaf
monotoneConstraints¶ (list) – used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.
monotoneConstraintsMethod¶ (str) – Monotone constraints method. basic, intermediate, or advanced.
monotonePenalty¶ (float) – A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.
numBatches¶ (int) – If greater than 0, splits data into separate batches during training
numIterations¶ (int) – Number of iterations, LightGBM constructs num_class * num_iterations trees
numTasks¶ (int) – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
numThreads¶ (int) – Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.
objective¶ (str) – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
objectiveSeed¶ (int) – Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.
otherRate¶ (float) – The retain ratio of small gradient data. Only used in goss.
parallelism¶ (str) – Tree learner parallelism, can be set to data_parallel or voting_parallel
passThroughArgs¶ (str) – Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true
predictDisableShapeCheck¶ (bool) – control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data
referenceDataset¶ (list) – The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().
repartitionByGroupingColumn¶ (bool) – Repartition training data according to grouping column, on by default.
samplingMode¶ (str) – Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.
samplingSubsetSize¶ (int) – Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.
skipDrop¶ (float) – Probability of skipping the dropout procedure during a boosting iteration
slotNames¶ (list) – List of slot names in the features column
topK¶ (int) – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
topRate¶ (float) – The retain ratio of large gradient data. Only used in goss.
tweedieVariancePower¶ (float) – control the variance of tweedie distribution, must be between 1 and 2
uniformDrop¶ (bool) – Set this to true to use uniform drop in dart mode
useBarrierExecutionMode¶ (bool) – Barrier execution mode which uses a barrier stage, off by default.
useMissing¶ (bool) – Set this to false to disable the special handle of missing value
useSingleDatasetMode¶ (bool) – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.
validationIndicatorCol¶ (str) – Indicates whether the row is for training or validation
verbosity¶ (int) – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
xGBoostDartMode¶ (bool) – Set this to true to use xgboost dart mode
zeroAsMissing¶ (bool) – Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values
- alpha = Param(parent='undefined', name='alpha', doc='parameter for Huber loss and Quantile regression')
- baggingFraction = Param(parent='undefined', name='baggingFraction', doc='Bagging fraction')
- baggingFreq = Param(parent='undefined', name='baggingFreq', doc='Bagging frequency')
- baggingSeed = Param(parent='undefined', name='baggingSeed', doc='Bagging seed')
- binSampleCount = Param(parent='undefined', name='binSampleCount', doc='Number of samples considered at computing histogram bins')
- boostFromAverage = Param(parent='undefined', name='boostFromAverage', doc='Adjusts initial score to the mean of labels for faster convergence')
- boostingType = Param(parent='undefined', name='boostingType', doc='Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling). ')
- catSmooth = Param(parent='undefined', name='catSmooth', doc='this can reduce the effect of noises in categorical features, especially for categories with few data')
- categoricalSlotIndexes = Param(parent='undefined', name='categoricalSlotIndexes', doc='List of categorical column indexes, the slot index in the features column')
- categoricalSlotNames = Param(parent='undefined', name='categoricalSlotNames', doc='List of categorical column slot names, the slot name in the features column')
- catl2 = Param(parent='undefined', name='catl2', doc='L2 regularization in categorical split')
- chunkSize = Param(parent='undefined', name='chunkSize', doc='Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.')
- dataRandomSeed = Param(parent='undefined', name='dataRandomSeed', doc='Random seed for sampling data to construct histogram bins.')
- dataTransferMode = Param(parent='undefined', name='dataTransferMode', doc='Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.')
- defaultListenPort = Param(parent='undefined', name='defaultListenPort', doc='The default listen port on executors, used for testing')
- deterministic = Param(parent='undefined', name='deterministic', doc='Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true')
- driverListenPort = Param(parent='undefined', name='driverListenPort', doc='The listen port on a driver. Default value is 0 (random)')
- dropRate = Param(parent='undefined', name='dropRate', doc='Dropout rate: a fraction of previous trees to drop during the dropout')
- dropSeed = Param(parent='undefined', name='dropSeed', doc='Random seed to choose dropping models. Only used in dart.')
- earlyStoppingRound = Param(parent='undefined', name='earlyStoppingRound', doc='Early stopping round')
- executionMode = Param(parent='undefined', name='executionMode', doc='Deprecated. Please use dataTransferMode.')
- extraSeed = Param(parent='undefined', name='extraSeed', doc='Random seed for selecting threshold when extra_trees is true')
- featureFraction = Param(parent='undefined', name='featureFraction', doc='Feature fraction')
- featureFractionByNode = Param(parent='undefined', name='featureFractionByNode', doc='Feature fraction by node')
- featureFractionSeed = Param(parent='undefined', name='featureFractionSeed', doc='Feature fraction seed')
- featuresCol = Param(parent='undefined', name='featuresCol', doc='features column name')
- featuresShapCol = Param(parent='undefined', name='featuresShapCol', doc='Output SHAP vector column name after prediction containing the feature contribution values')
- fobj = Param(parent='undefined', name='fobj', doc='Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).')
- getBinSampleCount()[source]
- Returns:
Number of samples considered at computing histogram bins
- Return type:
binSampleCount
- getBoostFromAverage()[source]
- Returns:
Adjusts initial score to the mean of labels for faster convergence
- Return type:
boostFromAverage
- getBoostingType()[source]
- Returns:
Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
- Return type:
boostingType
- getCatSmooth()[source]
- Returns:
this can reduce the effect of noises in categorical features, especially for categories with few data
- Return type:
catSmooth
- getCategoricalSlotIndexes()[source]
- Returns:
List of categorical column indexes, the slot index in the features column
- Return type:
categoricalSlotIndexes
- getCategoricalSlotNames()[source]
- Returns:
List of categorical column slot names, the slot name in the features column
- Return type:
categoricalSlotNames
- getChunkSize()[source]
- Returns:
Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
- Return type:
chunkSize
- getDataRandomSeed()[source]
- Returns:
Random seed for sampling data to construct histogram bins.
- Return type:
dataRandomSeed
- getDataTransferMode()[source]
- Returns:
Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.
- Return type:
dataTransferMode
- getDefaultListenPort()[source]
- Returns:
The default listen port on executors, used for testing
- Return type:
defaultListenPort
- getDeterministic()[source]
- Returns:
Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true
- Return type:
deterministic
- getDriverListenPort()[source]
- Returns:
The listen port on a driver. Default value is 0 (random)
- Return type:
driverListenPort
- getDropRate()[source]
- Returns:
Dropout rate: a fraction of previous trees to drop during the dropout
- Return type:
dropRate
- getDropSeed()[source]
- Returns:
Random seed to choose dropping models. Only used in dart.
- Return type:
dropSeed
- getExecutionMode()[source]
- Returns:
Deprecated. Please use dataTransferMode.
- Return type:
executionMode
- getExtraSeed()[source]
- Returns:
Random seed for selecting threshold when extra_trees is true
- Return type:
extraSeed
- getFeatureFractionByNode()[source]
- Returns:
Feature fraction by node
- Return type:
featureFractionByNode
- getFeaturesShapCol()[source]
- Returns:
Output SHAP vector column name after prediction containing the feature contribution values
- Return type:
featuresShapCol
- getFobj()[source]
- Returns:
Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
- Return type:
fobj
- getImprovementTolerance()[source]
- Returns:
Tolerance to consider improvement in metric
- Return type:
improvementTolerance
- getInitScoreCol()[source]
- Returns:
The name of the initial score column, used for continued training
- Return type:
initScoreCol
- getIsEnableSparse()[source]
- Returns:
Used to enable/disable sparse optimization
- Return type:
isEnableSparse
- getIsProvideTrainingMetric()[source]
- Returns:
Whether output metric result over training dataset.
- Return type:
isProvideTrainingMetric
- getLeafPredictionCol()[source]
- Returns:
Predicted leaf indices’s column name
- Return type:
leafPredictionCol
- getMatrixType()[source]
- Returns:
Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
- Return type:
matrixType
- getMaxBinByFeature()[source]
- Returns:
Max number of bins for each feature
- Return type:
maxBinByFeature
- getMaxCatThreshold()[source]
- Returns:
limit number of split points considered for categorical features
- Return type:
maxCatThreshold
- getMaxCatToOnehot()[source]
- Returns:
when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used
- Return type:
maxCatToOnehot
- getMaxDeltaStep()[source]
- Returns:
Used to limit the max output of tree leaves
- Return type:
maxDeltaStep
- getMaxDrop()[source]
- Returns:
Max number of dropped trees during one boosting iteration
- Return type:
maxDrop
- getMaxNumClasses()[source]
- Returns:
Number of max classes to infer numClass in multi-class classification.
- Return type:
maxNumClasses
- getMaxStreamingOMPThreads()[source]
- Returns:
Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.
- Return type:
maxStreamingOMPThreads
- getMetric()[source]
- Returns:
Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
- Return type:
metric
- getMicroBatchSize()[source]
- Returns:
Specify how many elements are sent in a streaming micro-batch.
- Return type:
microBatchSize
- getMinDataInLeaf()[source]
- Returns:
Minimal number of data in one leaf. Can be used to deal with over-fitting.
- Return type:
minDataInLeaf
- getMinDataPerBin()[source]
- Returns:
Minimal number of data inside one bin
- Return type:
minDataPerBin
- getMinDataPerGroup()[source]
- Returns:
minimal number of data per categorical group
- Return type:
minDataPerGroup
- getMinSumHessianInLeaf()[source]
- Returns:
Minimal sum hessian in one leaf
- Return type:
minSumHessianInLeaf
- getMonotoneConstraints()[source]
- Returns:
used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.
- Return type:
monotoneConstraints
- getMonotoneConstraintsMethod()[source]
- Returns:
Monotone constraints method. basic, intermediate, or advanced.
- Return type:
monotoneConstraintsMethod
- getMonotonePenalty()[source]
- Returns:
A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.
- Return type:
monotonePenalty
- getNumBatches()[source]
- Returns:
If greater than 0, splits data into separate batches during training
- Return type:
numBatches
- getNumIterations()[source]
- Returns:
Number of iterations, LightGBM constructs num_class * num_iterations trees
- Return type:
numIterations
- getNumTasks()[source]
- Returns:
Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
- Return type:
numTasks
- getNumThreads()[source]
- Returns:
Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.
- Return type:
numThreads
- getObjective()[source]
- Returns:
The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
- Return type:
objective
- getObjectiveSeed()[source]
- Returns:
Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.
- Return type:
objectiveSeed
- getOtherRate()[source]
- Returns:
The retain ratio of small gradient data. Only used in goss.
- Return type:
otherRate
- getParallelism()[source]
- Returns:
Tree learner parallelism, can be set to data_parallel or voting_parallel
- Return type:
parallelism
- getPassThroughArgs()[source]
- Returns:
Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true
- Return type:
passThroughArgs
- getPredictDisableShapeCheck()[source]
- Returns:
control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data
- Return type:
predictDisableShapeCheck
- getReferenceDataset()[source]
- Returns:
The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().
- Return type:
referenceDataset
- getRepartitionByGroupingColumn()[source]
- Returns:
Repartition training data according to grouping column, on by default.
- Return type:
repartitionByGroupingColumn
- getSamplingMode()[source]
- Returns:
Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.
- Return type:
samplingMode
- getSamplingSubsetSize()[source]
- Returns:
Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.
- Return type:
samplingSubsetSize
- getSkipDrop()[source]
- Returns:
Probability of skipping the dropout procedure during a boosting iteration
- Return type:
skipDrop
- getTopK()[source]
- Returns:
The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
- Return type:
topK
- getTopRate()[source]
- Returns:
The retain ratio of large gradient data. Only used in goss.
- Return type:
topRate
- getTweedieVariancePower()[source]
- Returns:
control the variance of tweedie distribution, must be between 1 and 2
- Return type:
tweedieVariancePower
- getUniformDrop()[source]
- Returns:
Set this to true to use uniform drop in dart mode
- Return type:
uniformDrop
- getUseBarrierExecutionMode()[source]
- Returns:
Barrier execution mode which uses a barrier stage, off by default.
- Return type:
useBarrierExecutionMode
- getUseMissing()[source]
- Returns:
Set this to false to disable the special handle of missing value
- Return type:
useMissing
- getUseSingleDatasetMode()[source]
- Returns:
Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.
- Return type:
useSingleDatasetMode
- getValidationIndicatorCol()[source]
- Returns:
Indicates whether the row is for training or validation
- Return type:
validationIndicatorCol
- getVerbosity()[source]
- Returns:
Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
- Return type:
verbosity
- getXGBoostDartMode()[source]
- Returns:
Set this to true to use xgboost dart mode
- Return type:
xGBoostDartMode
- getZeroAsMissing()[source]
- Returns:
Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values
- Return type:
zeroAsMissing
- improvementTolerance = Param(parent='undefined', name='improvementTolerance', doc='Tolerance to consider improvement in metric')
- initScoreCol = Param(parent='undefined', name='initScoreCol', doc='The name of the initial score column, used for continued training')
- isEnableSparse = Param(parent='undefined', name='isEnableSparse', doc='Used to enable/disable sparse optimization')
- isProvideTrainingMetric = Param(parent='undefined', name='isProvideTrainingMetric', doc='Whether output metric result over training dataset.')
- labelCol = Param(parent='undefined', name='labelCol', doc='label column name')
- lambdaL1 = Param(parent='undefined', name='lambdaL1', doc='L1 regularization')
- lambdaL2 = Param(parent='undefined', name='lambdaL2', doc='L2 regularization')
- leafPredictionCol = Param(parent='undefined', name='leafPredictionCol', doc="Predicted leaf indices's column name")
- learningRate = Param(parent='undefined', name='learningRate', doc='Learning rate or shrinkage rate')
- matrixType = Param(parent='undefined', name='matrixType', doc='Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.')
- maxBin = Param(parent='undefined', name='maxBin', doc='Max bin')
- maxBinByFeature = Param(parent='undefined', name='maxBinByFeature', doc='Max number of bins for each feature')
- maxCatThreshold = Param(parent='undefined', name='maxCatThreshold', doc='limit number of split points considered for categorical features')
- maxCatToOnehot = Param(parent='undefined', name='maxCatToOnehot', doc='when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used')
- maxDeltaStep = Param(parent='undefined', name='maxDeltaStep', doc='Used to limit the max output of tree leaves')
- maxDepth = Param(parent='undefined', name='maxDepth', doc='Max depth')
- maxDrop = Param(parent='undefined', name='maxDrop', doc='Max number of dropped trees during one boosting iteration')
- maxNumClasses = Param(parent='undefined', name='maxNumClasses', doc='Number of max classes to infer numClass in multi-class classification.')
- maxStreamingOMPThreads = Param(parent='undefined', name='maxStreamingOMPThreads', doc="Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it's best to set a fixed value.")
- metric = Param(parent='undefined', name='metric', doc='Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv. ')
- microBatchSize = Param(parent='undefined', name='microBatchSize', doc='Specify how many elements are sent in a streaming micro-batch.')
- minDataInLeaf = Param(parent='undefined', name='minDataInLeaf', doc='Minimal number of data in one leaf. Can be used to deal with over-fitting.')
- minDataPerBin = Param(parent='undefined', name='minDataPerBin', doc='Minimal number of data inside one bin')
- minDataPerGroup = Param(parent='undefined', name='minDataPerGroup', doc='minimal number of data per categorical group')
- minGainToSplit = Param(parent='undefined', name='minGainToSplit', doc='The minimal gain to perform split')
- minSumHessianInLeaf = Param(parent='undefined', name='minSumHessianInLeaf', doc='Minimal sum hessian in one leaf')
- modelString = Param(parent='undefined', name='modelString', doc='LightGBM model to retrain')
- monotoneConstraints = Param(parent='undefined', name='monotoneConstraints', doc='used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.')
- monotoneConstraintsMethod = Param(parent='undefined', name='monotoneConstraintsMethod', doc='Monotone constraints method. basic, intermediate, or advanced.')
- monotonePenalty = Param(parent='undefined', name='monotonePenalty', doc='A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.')
- negBaggingFraction = Param(parent='undefined', name='negBaggingFraction', doc='Negative Bagging fraction')
- numBatches = Param(parent='undefined', name='numBatches', doc='If greater than 0, splits data into separate batches during training')
- numIterations = Param(parent='undefined', name='numIterations', doc='Number of iterations, LightGBM constructs num_class * num_iterations trees')
- numLeaves = Param(parent='undefined', name='numLeaves', doc='Number of leaves')
- numTasks = Param(parent='undefined', name='numTasks', doc='Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.')
- numThreads = Param(parent='undefined', name='numThreads', doc='Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.')
- objective = Param(parent='undefined', name='objective', doc='The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova. ')
- objectiveSeed = Param(parent='undefined', name='objectiveSeed', doc='Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.')
- otherRate = Param(parent='undefined', name='otherRate', doc='The retain ratio of small gradient data. Only used in goss.')
- parallelism = Param(parent='undefined', name='parallelism', doc='Tree learner parallelism, can be set to data_parallel or voting_parallel')
- passThroughArgs = Param(parent='undefined', name='passThroughArgs', doc='Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true')
- posBaggingFraction = Param(parent='undefined', name='posBaggingFraction', doc='Positive Bagging fraction')
- predictDisableShapeCheck = Param(parent='undefined', name='predictDisableShapeCheck', doc='control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data')
- predictionCol = Param(parent='undefined', name='predictionCol', doc='prediction column name')
- referenceDataset = Param(parent='undefined', name='referenceDataset', doc='The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().')
- repartitionByGroupingColumn = Param(parent='undefined', name='repartitionByGroupingColumn', doc='Repartition training data according to grouping column, on by default.')
- samplingMode = Param(parent='undefined', name='samplingMode', doc="Data sampling for streaming mode. Sampled data is used to define bins. 'global': sample from all data, 'subset': sample from first N rows, or 'fixed': Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.")
- samplingSubsetSize = Param(parent='undefined', name='samplingSubsetSize', doc="Specify subset size N for the sampling mode 'subset'. 'binSampleCount' rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.")
- seed = Param(parent='undefined', name='seed', doc='Main seed, used to generate other seeds')
- setBinSampleCount(value)[source]
- Parameters:
binSampleCount¶ – Number of samples considered at computing histogram bins
- setBoostFromAverage(value)[source]
- Parameters:
boostFromAverage¶ – Adjusts initial score to the mean of labels for faster convergence
- setBoostingType(value)[source]
- Parameters:
boostingType¶ – Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).
- setCatSmooth(value)[source]
- Parameters:
catSmooth¶ – this can reduce the effect of noises in categorical features, especially for categories with few data
- setCategoricalSlotIndexes(value)[source]
- Parameters:
categoricalSlotIndexes¶ – List of categorical column indexes, the slot index in the features column
- setCategoricalSlotNames(value)[source]
- Parameters:
categoricalSlotNames¶ – List of categorical column slot names, the slot name in the features column
- setChunkSize(value)[source]
- Parameters:
chunkSize¶ – Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset.
- setDataRandomSeed(value)[source]
- Parameters:
dataRandomSeed¶ – Random seed for sampling data to construct histogram bins.
- setDataTransferMode(value)[source]
- Parameters:
dataTransferMode¶ – Specify how SynapseML transfers data from Spark to LightGBM. Values can be streaming, bulk. Default is bulk, which is the legacy mode.
- setDefaultListenPort(value)[source]
- Parameters:
defaultListenPort¶ – The default listen port on executors, used for testing
- setDeterministic(value)[source]
- Parameters:
deterministic¶ – Used only with cpu devide type. Setting this to true should ensure stable results when using the same data and the same parameters. Note: setting this to true may slow down training. To avoid potential instability due to numerical issues, please set force_col_wise=true or force_row_wise=true when setting deterministic=true
- setDriverListenPort(value)[source]
- Parameters:
driverListenPort¶ – The listen port on a driver. Default value is 0 (random)
- setDropRate(value)[source]
- Parameters:
dropRate¶ – Dropout rate: a fraction of previous trees to drop during the dropout
- setDropSeed(value)[source]
- Parameters:
dropSeed¶ – Random seed to choose dropping models. Only used in dart.
- setExecutionMode(value)[source]
- Parameters:
executionMode¶ – Deprecated. Please use dataTransferMode.
- setExtraSeed(value)[source]
- Parameters:
extraSeed¶ – Random seed for selecting threshold when extra_trees is true
- setFeatureFractionByNode(value)[source]
- Parameters:
featureFractionByNode¶ – Feature fraction by node
- setFeaturesShapCol(value)[source]
- Parameters:
featuresShapCol¶ – Output SHAP vector column name after prediction containing the feature contribution values
- setFobj(value)[source]
- Parameters:
fobj¶ – Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).
- setImprovementTolerance(value)[source]
- Parameters:
improvementTolerance¶ – Tolerance to consider improvement in metric
- setInitScoreCol(value)[source]
- Parameters:
initScoreCol¶ – The name of the initial score column, used for continued training
- setIsEnableSparse(value)[source]
- Parameters:
isEnableSparse¶ – Used to enable/disable sparse optimization
- setIsProvideTrainingMetric(value)[source]
- Parameters:
isProvideTrainingMetric¶ – Whether output metric result over training dataset.
- setLeafPredictionCol(value)[source]
- Parameters:
leafPredictionCol¶ – Predicted leaf indices’s column name
- setMatrixType(value)[source]
- Parameters:
matrixType¶ – Advanced parameter to specify whether the native lightgbm matrix constructed should be sparse or dense. Values can be auto, sparse or dense. Default value is auto, which samples first ten rows to determine type.
- setMaxBinByFeature(value)[source]
- Parameters:
maxBinByFeature¶ – Max number of bins for each feature
- setMaxCatThreshold(value)[source]
- Parameters:
maxCatThreshold¶ – limit number of split points considered for categorical features
- setMaxCatToOnehot(value)[source]
- Parameters:
maxCatToOnehot¶ – when number of categories of one feature smaller than or equal to this, one-vs-other split algorithm will be used
- setMaxDeltaStep(value)[source]
- Parameters:
maxDeltaStep¶ – Used to limit the max output of tree leaves
- setMaxDrop(value)[source]
- Parameters:
maxDrop¶ – Max number of dropped trees during one boosting iteration
- setMaxNumClasses(value)[source]
- Parameters:
maxNumClasses¶ – Number of max classes to infer numClass in multi-class classification.
- setMaxStreamingOMPThreads(value)[source]
- Parameters:
maxStreamingOMPThreads¶ – Maximum number of OpenMP threads used by a LightGBM thread. Used only for thread-safe buffer allocation. Use -1 to use OpenMP default, but in a Spark environment it’s best to set a fixed value.
- setMetric(value)[source]
- Parameters:
metric¶ – Metrics to be evaluated on the evaluation data. Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.
- setMicroBatchSize(value)[source]
- Parameters:
microBatchSize¶ – Specify how many elements are sent in a streaming micro-batch.
- setMinDataInLeaf(value)[source]
- Parameters:
minDataInLeaf¶ – Minimal number of data in one leaf. Can be used to deal with over-fitting.
- setMinDataPerGroup(value)[source]
- Parameters:
minDataPerGroup¶ – minimal number of data per categorical group
- setMinSumHessianInLeaf(value)[source]
- Parameters:
minSumHessianInLeaf¶ – Minimal sum hessian in one leaf
- setMonotoneConstraints(value)[source]
- Parameters:
monotoneConstraints¶ – used for constraints of monotonic features. 1 means increasing, -1 means decreasing, 0 means non-constraint. Specify all features in order.
- setMonotoneConstraintsMethod(value)[source]
- Parameters:
monotoneConstraintsMethod¶ – Monotone constraints method. basic, intermediate, or advanced.
- setMonotonePenalty(value)[source]
- Parameters:
monotonePenalty¶ – A penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree.
- setNumBatches(value)[source]
- Parameters:
numBatches¶ – If greater than 0, splits data into separate batches during training
- setNumIterations(value)[source]
- Parameters:
numIterations¶ – Number of iterations, LightGBM constructs num_class * num_iterations trees
- setNumTasks(value)[source]
- Parameters:
numTasks¶ – Advanced parameter to specify the number of tasks. SynapseML tries to guess this based on cluster configuration, but this parameter can be used to override.
- setNumThreads(value)[source]
- Parameters:
numThreads¶ – Number of threads per executor for LightGBM. For the best speed, set this to the number of real CPU cores.
- setObjective(value)[source]
- Parameters:
objective¶ – The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.
- setObjectiveSeed(value)[source]
- Parameters:
objectiveSeed¶ – Random seed for objectives, if random process is needed. Currently used only for rank_xendcg objective.
- setOtherRate(value)[source]
- Parameters:
otherRate¶ – The retain ratio of small gradient data. Only used in goss.
- setParallelism(value)[source]
- Parameters:
parallelism¶ – Tree learner parallelism, can be set to data_parallel or voting_parallel
- setParams(alpha=0.9, baggingFraction=1.0, baggingFreq=0, baggingSeed=3, binSampleCount=200000, boostFromAverage=True, boostingType='gbdt', catSmooth=10.0, categoricalSlotIndexes=[], categoricalSlotNames=[], catl2=10.0, chunkSize=10000, dataRandomSeed=1, dataTransferMode='streaming', defaultListenPort=12400, deterministic=False, driverListenPort=0, dropRate=0.1, dropSeed=4, earlyStoppingRound=0, executionMode=None, extraSeed=6, featureFraction=1.0, featureFractionByNode=None, featureFractionSeed=2, featuresCol='features', featuresShapCol='', fobj=None, improvementTolerance=0.0, initScoreCol=None, isEnableSparse=True, isProvideTrainingMetric=False, labelCol='label', lambdaL1=0.0, lambdaL2=0.0, leafPredictionCol='', learningRate=0.1, matrixType='auto', maxBin=255, maxBinByFeature=[], maxCatThreshold=32, maxCatToOnehot=4, maxDeltaStep=0.0, maxDepth=- 1, maxDrop=50, maxNumClasses=100, maxStreamingOMPThreads=16, metric='', microBatchSize=100, minDataInLeaf=20, minDataPerBin=3, minDataPerGroup=100, minGainToSplit=0.0, minSumHessianInLeaf=0.001, modelString='', monotoneConstraints=[], monotoneConstraintsMethod='basic', monotonePenalty=0.0, negBaggingFraction=1.0, numBatches=0, numIterations=100, numLeaves=31, numTasks=0, numThreads=0, objective='regression', objectiveSeed=5, otherRate=0.1, parallelism='data_parallel', passThroughArgs='', posBaggingFraction=1.0, predictDisableShapeCheck=False, predictionCol='prediction', referenceDataset=None, repartitionByGroupingColumn=True, samplingMode='subset', samplingSubsetSize=1000000, seed=None, skipDrop=0.5, slotNames=[], timeout=1200.0, topK=20, topRate=0.2, tweedieVariancePower=1.5, uniformDrop=False, useBarrierExecutionMode=False, useMissing=True, useSingleDatasetMode=True, validationIndicatorCol=None, verbosity=- 1, weightCol=None, xGBoostDartMode=False, zeroAsMissing=False)[source]
Set the (keyword only) parameters
- setPassThroughArgs(value)[source]
- Parameters:
passThroughArgs¶ – Direct string to pass through to LightGBM library (appended with other explicitly set params). Will override any parameters given with explicit setters. Can include multiple parameters in one string. e.g., force_row_wise=true
- setPredictDisableShapeCheck(value)[source]
- Parameters:
predictDisableShapeCheck¶ – control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data
- setReferenceDataset(value)[source]
- Parameters:
referenceDataset¶ – The reference Dataset that was used for the fit. If using samplingMode=custom, this must be set before fit().
- setRepartitionByGroupingColumn(value)[source]
- Parameters:
repartitionByGroupingColumn¶ – Repartition training data according to grouping column, on by default.
- setSamplingMode(value)[source]
- Parameters:
samplingMode¶ – Data sampling for streaming mode. Sampled data is used to define bins. ‘global’: sample from all data, ‘subset’: sample from first N rows, or ‘fixed’: Take first N rows as sample.Values can be global, subset, or fixed. Default is subset.
- setSamplingSubsetSize(value)[source]
- Parameters:
samplingSubsetSize¶ – Specify subset size N for the sampling mode ‘subset’. ‘binSampleCount’ rows will be chosen from the first N values of the dataset. Subset can be used when rows are expected to be random and data is huge.
- setSkipDrop(value)[source]
- Parameters:
skipDrop¶ – Probability of skipping the dropout procedure during a boosting iteration
- setTopK(value)[source]
- Parameters:
topK¶ – The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0
- setTopRate(value)[source]
- Parameters:
topRate¶ – The retain ratio of large gradient data. Only used in goss.
- setTweedieVariancePower(value)[source]
- Parameters:
tweedieVariancePower¶ – control the variance of tweedie distribution, must be between 1 and 2
- setUniformDrop(value)[source]
- Parameters:
uniformDrop¶ – Set this to true to use uniform drop in dart mode
- setUseBarrierExecutionMode(value)[source]
- Parameters:
useBarrierExecutionMode¶ – Barrier execution mode which uses a barrier stage, off by default.
- setUseMissing(value)[source]
- Parameters:
useMissing¶ – Set this to false to disable the special handle of missing value
- setUseSingleDatasetMode(value)[source]
- Parameters:
useSingleDatasetMode¶ – Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.
- setValidationIndicatorCol(value)[source]
- Parameters:
validationIndicatorCol¶ – Indicates whether the row is for training or validation
- setVerbosity(value)[source]
- Parameters:
verbosity¶ – Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
- setXGBoostDartMode(value)[source]
- Parameters:
xGBoostDartMode¶ – Set this to true to use xgboost dart mode
- setZeroAsMissing(value)[source]
- Parameters:
zeroAsMissing¶ – Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values
- skipDrop = Param(parent='undefined', name='skipDrop', doc='Probability of skipping the dropout procedure during a boosting iteration')
- slotNames = Param(parent='undefined', name='slotNames', doc='List of slot names in the features column')
- timeout = Param(parent='undefined', name='timeout', doc='Timeout in seconds')
- topK = Param(parent='undefined', name='topK', doc='The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0')
- topRate = Param(parent='undefined', name='topRate', doc='The retain ratio of large gradient data. Only used in goss.')
- tweedieVariancePower = Param(parent='undefined', name='tweedieVariancePower', doc='control the variance of tweedie distribution, must be between 1 and 2')
- uniformDrop = Param(parent='undefined', name='uniformDrop', doc='Set this to true to use uniform drop in dart mode')
- useBarrierExecutionMode = Param(parent='undefined', name='useBarrierExecutionMode', doc='Barrier execution mode which uses a barrier stage, off by default.')
- useMissing = Param(parent='undefined', name='useMissing', doc='Set this to false to disable the special handle of missing value')
- useSingleDatasetMode = Param(parent='undefined', name='useSingleDatasetMode', doc='Use single dataset execution mode to create a single native dataset per executor (singleton) to reduce memory and communication overhead.')
- validationIndicatorCol = Param(parent='undefined', name='validationIndicatorCol', doc='Indicates whether the row is for training or validation')
- verbosity = Param(parent='undefined', name='verbosity', doc='Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug')
- weightCol = Param(parent='undefined', name='weightCol', doc='The name of the weight column')
- xGBoostDartMode = Param(parent='undefined', name='xGBoostDartMode', doc='Set this to true to use xgboost dart mode')
- zeroAsMissing = Param(parent='undefined', name='zeroAsMissing', doc='Set to true to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices). Set to false to use na for representing missing values')
synapse.ml.lightgbm.mixin module
- class synapse.ml.lightgbm.mixin.LightGBMModelMixin[source]
Bases:
object
- getBoosterBestIteration()[source]
Get the best iteration from the booster.
- Returns:
The best iteration, if early stopping was triggered.
- getBoosterNumFeatures()[source]
Get the number of features from the booster.
- Returns:
The number of features.
- getBoosterNumTotalIterations()[source]
Get the total number of iterations trained.
- Returns:
The total number of iterations trained.
- getBoosterNumTotalModel()[source]
Get the total number of models trained.
- Returns:
The total number of models.
- getFeatureImportances(importance_type='split')[source]
Get the feature importances as a list. The importance_type can be “split” or “gain”.
Module contents
SynapseML is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark in several new directions. SynapseML adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV. These tools enable powerful and highly-scalable predictive and analytical models for a variety of datasources.
SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. In this vein, SynapseML provides easy to use SparkML transformers for a wide variety of Microsoft Cognitive Services. For production grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.
SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+.