synapse.ml.stages package
Submodules
synapse.ml.stages.Cacher module
- class synapse.ml.stages.Cacher.Cacher(java_obj=None, disable=False)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
disable¶ (bool) – Whether or disable caching (so that you can turn it off during evaluation)
- disable = Param(parent='undefined', name='disable', doc='Whether or disable caching (so that you can turn it off during evaluation)')
- getDisable()[source]
- Returns
Whether or disable caching (so that you can turn it off during evaluation)
- Return type
disable
synapse.ml.stages.ClassBalancer module
- class synapse.ml.stages.ClassBalancer.ClassBalancer(java_obj=None, broadcastJoin=True, inputCol=None, outputCol='weight')[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- broadcastJoin = Param(parent='undefined', name='broadcastJoin', doc='Whether to broadcast the class to weight mapping to the worker')
- getBroadcastJoin()[source]
- Returns
Whether to broadcast the class to weight mapping to the worker
- Return type
broadcastJoin
- inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
synapse.ml.stages.ClassBalancerModel module
- class synapse.ml.stages.ClassBalancerModel.ClassBalancerModel(java_obj=None, broadcastJoin=None, inputCol=None, outputCol=None, weights=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- broadcastJoin = Param(parent='undefined', name='broadcastJoin', doc='whether to broadcast join')
- inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
- setParams(broadcastJoin=None, inputCol=None, outputCol=None, weights=None)[source]
Set the (keyword only) parameters
- weights = Param(parent='undefined', name='weights', doc='the dataframe of weights')
synapse.ml.stages.DropColumns module
synapse.ml.stages.DynamicMiniBatchTransformer module
synapse.ml.stages.EnsembleByKey module
- class synapse.ml.stages.EnsembleByKey.EnsembleByKey(java_obj=None, colNames=None, collapseGroup=True, cols=None, keys=None, strategy='mean', vectorDims=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- colNames = Param(parent='undefined', name='colNames', doc='Names of the result of each col')
- collapseGroup = Param(parent='undefined', name='collapseGroup', doc='Whether to collapse all items in group to one entry')
- cols = Param(parent='undefined', name='cols', doc='Cols to ensemble')
- getCollapseGroup()[source]
- Returns
Whether to collapse all items in group to one entry
- Return type
collapseGroup
- getVectorDims()[source]
- Returns
the dimensions of any vector columns, used to avoid materialization
- Return type
vectorDims
- keys = Param(parent='undefined', name='keys', doc='Keys to group by')
- setCollapseGroup(value)[source]
- Parameters
collapseGroup¶ – Whether to collapse all items in group to one entry
- setParams(colNames=None, collapseGroup=True, cols=None, keys=None, strategy='mean', vectorDims=None)[source]
Set the (keyword only) parameters
- setVectorDims(value)[source]
- Parameters
vectorDims¶ – the dimensions of any vector columns, used to avoid materialization
- strategy = Param(parent='undefined', name='strategy', doc='How to ensemble the scores, ex: mean')
- vectorDims = Param(parent='undefined', name='vectorDims', doc='the dimensions of any vector columns, used to avoid materialization')
synapse.ml.stages.Explode module
- class synapse.ml.stages.Explode.Explode(java_obj=None, inputCol=None, outputCol='Explode_69f00561ca14_output')[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
synapse.ml.stages.FixedMiniBatchTransformer module
- class synapse.ml.stages.FixedMiniBatchTransformer.FixedMiniBatchTransformer(java_obj=None, batchSize=None, buffered=False, maxBufferSize=2147483647)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- batchSize = Param(parent='undefined', name='batchSize', doc='The max size of the buffer')
- buffered = Param(parent='undefined', name='buffered', doc='Whether or not to buffer batches in memory')
- maxBufferSize = Param(parent='undefined', name='maxBufferSize', doc='The max size of the buffer')
synapse.ml.stages.FlattenBatch module
synapse.ml.stages.Lambda module
- class synapse.ml.stages.Lambda.Lambda(java_obj=None, transformFunc=None, transformSchemaFunc=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- getTransformSchemaFunc()[source]
- Returns
the output schema after the transformation
- Return type
transformSchemaFunc
- setTransformSchemaFunc(value)[source]
- Parameters
transformSchemaFunc¶ – the output schema after the transformation
- transformFunc = Param(parent='undefined', name='transformFunc', doc='holder for dataframe function')
- transformSchemaFunc = Param(parent='undefined', name='transformSchemaFunc', doc='the output schema after the transformation')
synapse.ml.stages.MultiColumnAdapter module
- class synapse.ml.stages.MultiColumnAdapter.MultiColumnAdapter(java_obj=None, baseStage=None, inputCols=None, outputCols=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- baseStage = Param(parent='undefined', name='baseStage', doc='base pipeline stage to apply to every column')
- inputCols = Param(parent='undefined', name='inputCols', doc='list of column names encoded as a string')
- outputCols = Param(parent='undefined', name='outputCols', doc='list of column names encoded as a string')
synapse.ml.stages.PartitionConsolidator module
- class synapse.ml.stages.PartitionConsolidator.PartitionConsolidator(java_obj=None, concurrency=1, concurrentTimeout=None, inputCol=None, outputCol=None, timeout=60.0)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- concurrency = Param(parent='undefined', name='concurrency', doc='max number of concurrent calls')
- concurrentTimeout = Param(parent='undefined', name='concurrentTimeout', doc='max number seconds to wait on futures if concurrency >= 1')
- getConcurrentTimeout()[source]
- Returns
max number seconds to wait on futures if concurrency >= 1
- Return type
concurrentTimeout
- getTimeout()[source]
- Returns
number of seconds to wait before closing the connection
- Return type
timeout
- inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
- setConcurrentTimeout(value)[source]
- Parameters
concurrentTimeout¶ – max number seconds to wait on futures if concurrency >= 1
- setParams(concurrency=1, concurrentTimeout=None, inputCol=None, outputCol=None, timeout=60.0)[source]
Set the (keyword only) parameters
- setTimeout(value)[source]
- Parameters
timeout¶ – number of seconds to wait before closing the connection
- timeout = Param(parent='undefined', name='timeout', doc='number of seconds to wait before closing the connection')
synapse.ml.stages.RenameColumn module
- class synapse.ml.stages.RenameColumn.RenameColumn(java_obj=None, inputCol=None, outputCol=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
synapse.ml.stages.Repartition module
- class synapse.ml.stages.Repartition.Repartition(java_obj=None, disable=False, n=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- disable = Param(parent='undefined', name='disable', doc='Whether to disable repartitioning (so that one can turn it off for evaluation)')
- getDisable()[source]
- Returns
Whether to disable repartitioning (so that one can turn it off for evaluation)
- Return type
disable
- n = Param(parent='undefined', name='n', doc='Number of partitions')
synapse.ml.stages.SelectColumns module
synapse.ml.stages.StratifiedRepartition module
- class synapse.ml.stages.StratifiedRepartition.StratifiedRepartition(java_obj=None, labelCol=None, mode='mixed', seed=1518410069)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- getMode()[source]
- Returns
Specify equal to repartition with replacement across all labels, specify original to keep the ratios in the original dataset, or specify mixed to use a heuristic
- Return type
mode
- labelCol = Param(parent='undefined', name='labelCol', doc='The name of the label column')
- mode = Param(parent='undefined', name='mode', doc='Specify equal to repartition with replacement across all labels, specify original to keep the ratios in the original dataset, or specify mixed to use a heuristic')
- seed = Param(parent='undefined', name='seed', doc='random seed')
synapse.ml.stages.SummarizeData module
- class synapse.ml.stages.SummarizeData.SummarizeData(java_obj=None, basic=True, counts=True, errorThreshold=0.0, percentiles=True, sample=True)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- basic = Param(parent='undefined', name='basic', doc='Compute basic statistics')
- counts = Param(parent='undefined', name='counts', doc='Compute count statistics')
- errorThreshold = Param(parent='undefined', name='errorThreshold', doc='Threshold for quantiles - 0 is exact')
- getErrorThreshold()[source]
- Returns
Threshold for quantiles - 0 is exact
- Return type
errorThreshold
- percentiles = Param(parent='undefined', name='percentiles', doc='Compute percentiles')
- sample = Param(parent='undefined', name='sample', doc='Compute sample statistics')
synapse.ml.stages.TextPreprocessor module
- class synapse.ml.stages.TextPreprocessor.TextPreprocessor(java_obj=None, inputCol=None, map=None, normFunc=None, outputCol=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
- map = Param(parent='undefined', name='map', doc='Map of substring match to replacement')
- normFunc = Param(parent='undefined', name='normFunc', doc='Name of normalization function to apply')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
synapse.ml.stages.TimeIntervalMiniBatchTransformer module
- class synapse.ml.stages.TimeIntervalMiniBatchTransformer.TimeIntervalMiniBatchTransformer(java_obj=None, maxBatchSize=2147483647, millisToWait=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- getMillisToWait()[source]
- Returns
The time to wait before constructing a batch
- Return type
millisToWait
- maxBatchSize = Param(parent='undefined', name='maxBatchSize', doc='The max size of the buffer')
- millisToWait = Param(parent='undefined', name='millisToWait', doc='The time to wait before constructing a batch')
synapse.ml.stages.Timer module
- class synapse.ml.stages.Timer.Timer(java_obj=None, disableMaterialization=True, logToScala=True, stage=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- disableMaterialization = Param(parent='undefined', name='disableMaterialization', doc='Whether to disable timing (so that one can turn it off for evaluation)')
- getDisableMaterialization()[source]
- Returns
Whether to disable timing (so that one can turn it off for evaluation)
- Return type
disableMaterialization
- getLogToScala()[source]
- Returns
Whether to output the time to the scala console
- Return type
logToScala
- logToScala = Param(parent='undefined', name='logToScala', doc='Whether to output the time to the scala console')
- setDisableMaterialization(value)[source]
- Parameters
disableMaterialization¶ – Whether to disable timing (so that one can turn it off for evaluation)
- setLogToScala(value)[source]
- Parameters
logToScala¶ – Whether to output the time to the scala console
- setParams(disableMaterialization=True, logToScala=True, stage=None)[source]
Set the (keyword only) parameters
- stage = Param(parent='undefined', name='stage', doc='The stage to time')
synapse.ml.stages.TimerModel module
- class synapse.ml.stages.TimerModel.TimerModel(java_obj=None, disableMaterialization=True, logToScala=True, stage=None, transformer=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- disableMaterialization = Param(parent='undefined', name='disableMaterialization', doc='Whether to disable timing (so that one can turn it off for evaluation)')
- getDisableMaterialization()[source]
- Returns
Whether to disable timing (so that one can turn it off for evaluation)
- Return type
disableMaterialization
- getLogToScala()[source]
- Returns
Whether to output the time to the scala console
- Return type
logToScala
- logToScala = Param(parent='undefined', name='logToScala', doc='Whether to output the time to the scala console')
- setDisableMaterialization(value)[source]
- Parameters
disableMaterialization¶ – Whether to disable timing (so that one can turn it off for evaluation)
- setLogToScala(value)[source]
- Parameters
logToScala¶ – Whether to output the time to the scala console
- setParams(disableMaterialization=True, logToScala=True, stage=None, transformer=None)[source]
Set the (keyword only) parameters
- stage = Param(parent='undefined', name='stage', doc='The stage to time')
- transformer = Param(parent='undefined', name='transformer', doc='inner model to time')
synapse.ml.stages.UDFTransformer module
synapse.ml.stages.UnicodeNormalize module
- class synapse.ml.stages.UnicodeNormalize.UnicodeNormalize(java_obj=None, form=None, inputCol=None, lower=None, outputCol=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- form = Param(parent='undefined', name='form', doc='Unicode normalization form: NFC, NFD, NFKC, NFKD')
- inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
- lower = Param(parent='undefined', name='lower', doc='Lowercase text')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
Module contents
SynapseML is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark in several new directions. SynapseML adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV. These tools enable powerful and highly-scalable predictive and analytical models for a variety of datasources.
SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. In this vein, SynapseML provides easy to use SparkML transformers for a wide variety of Microsoft Cognitive Services. For production grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.
SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+.