synapse.ml.stages package

Submodules

synapse.ml.stages.Cacher module

class synapse.ml.stages.Cacher.Cacher(java_obj=None, disable=False)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters: disable¶ (bool) – Whether or disable caching (so that you can turn it off during evaluation)

disable = Param(parent='undefined', name='disable', doc='Whether or disable caching (so that you can turn it off during evaluation)')

getDisable()[source]

Returns: Whether or disable caching (so that you can turn it off during evaluation)
Return type: disable

static getJavaPackage()[source]: Returns package name String.

classmethod read()[source]: Returns an MLReader instance for this class.

setDisable(value)[source]

Parameters: disable¶ – Whether or disable caching (so that you can turn it off during evaluation)

setParams(disable=False)[source]: Set the (keyword only) parameters

synapse.ml.stages.ClassBalancer module

class synapse.ml.stages.ClassBalancer.ClassBalancer(java_obj=None, broadcastJoin=True, inputCol=None, outputCol='weight')[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaEstimator

Parameters

broadcastJoin¶ (bool) – Whether to broadcast the class to weight mapping to the worker
inputCol¶ (str) – The name of the input column
outputCol¶ (str) – The name of the output column

broadcastJoin = Param(parent='undefined', name='broadcastJoin', doc='Whether to broadcast the class to weight mapping to the worker')

getBroadcastJoin()[source]

Returns: Whether to broadcast the class to weight mapping to the worker
Return type: broadcastJoin

getInputCol()[source]

Returns: The name of the input column
Return type: inputCol

static getJavaPackage()[source]: Returns package name String.

getOutputCol()[source]

Returns: The name of the output column
Return type: outputCol

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')

outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')

classmethod read()[source]: Returns an MLReader instance for this class.

setBroadcastJoin(value)[source]

Parameters: broadcastJoin¶ – Whether to broadcast the class to weight mapping to the worker

setInputCol(value)[source]

Parameters: inputCol¶ – The name of the input column

setOutputCol(value)[source]

Parameters: outputCol¶ – The name of the output column

setParams(broadcastJoin=True, inputCol=None, outputCol='weight')[source]: Set the (keyword only) parameters

synapse.ml.stages.ClassBalancerModel module

class synapse.ml.stages.ClassBalancerModel.ClassBalancerModel(java_obj=None, broadcastJoin=None, inputCol=None, outputCol=None, weights=None)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaModel

Parameters

broadcastJoin¶ (bool) – whether to broadcast join
inputCol¶ (str) – The name of the input column
outputCol¶ (str) – The name of the output column
weights¶ (object) – the dataframe of weights

broadcastJoin = Param(parent='undefined', name='broadcastJoin', doc='whether to broadcast join')

getBroadcastJoin()[source]

Returns: whether to broadcast join
Return type: broadcastJoin

getInputCol()[source]

Returns: The name of the input column
Return type: inputCol

static getJavaPackage()[source]: Returns package name String.

getOutputCol()[source]

Returns: The name of the output column
Return type: outputCol

getWeights()[source]

Returns: the dataframe of weights
Return type: weights

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')

outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')

classmethod read()[source]: Returns an MLReader instance for this class.

setBroadcastJoin(value)[source]

Parameters: broadcastJoin¶ – whether to broadcast join

setInputCol(value)[source]

Parameters: inputCol¶ – The name of the input column

setOutputCol(value)[source]

Parameters: outputCol¶ – The name of the output column

setParams(broadcastJoin=None, inputCol=None, outputCol=None, weights=None)[source]: Set the (keyword only) parameters

setWeights(value)[source]

Parameters: weights¶ – the dataframe of weights

weights = Param(parent='undefined', name='weights', doc='the dataframe of weights')

synapse.ml.stages.DropColumns module

class synapse.ml.stages.DropColumns.DropColumns(java_obj=None, cols=None)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters: cols¶ (list) – Comma separated list of column names

cols = Param(parent='undefined', name='cols', doc='Comma separated list of column names')

getCols()[source]

Returns: Comma separated list of column names
Return type: cols

static getJavaPackage()[source]: Returns package name String.

classmethod read()[source]: Returns an MLReader instance for this class.

setCols(value)[source]

Parameters: cols¶ – Comma separated list of column names

setParams(cols=None)[source]: Set the (keyword only) parameters

synapse.ml.stages.DynamicMiniBatchTransformer module

class synapse.ml.stages.DynamicMiniBatchTransformer.DynamicMiniBatchTransformer(java_obj=None, maxBatchSize=2147483647)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters: maxBatchSize¶ (int) – The max size of the buffer

static getJavaPackage()[source]: Returns package name String.

getMaxBatchSize()[source]

Returns: The max size of the buffer
Return type: maxBatchSize

maxBatchSize = Param(parent='undefined', name='maxBatchSize', doc='The max size of the buffer')

classmethod read()[source]: Returns an MLReader instance for this class.

setMaxBatchSize(value)[source]

Parameters: maxBatchSize¶ – The max size of the buffer

setParams(maxBatchSize=2147483647)[source]: Set the (keyword only) parameters

synapse.ml.stages.EnsembleByKey module

class synapse.ml.stages.EnsembleByKey.EnsembleByKey(java_obj=None, colNames=None, collapseGroup=True, cols=None, keys=None, strategy='mean', vectorDims=None)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters

colNames¶ (list) – Names of the result of each col
collapseGroup¶ (bool) – Whether to collapse all items in group to one entry
cols¶ (list) – Cols to ensemble
keys¶ (list) – Keys to group by
strategy¶ (str) – How to ensemble the scores, ex: mean
vectorDims¶ (dict) – the dimensions of any vector columns, used to avoid materialization

colNames = Param(parent='undefined', name='colNames', doc='Names of the result of each col')

collapseGroup = Param(parent='undefined', name='collapseGroup', doc='Whether to collapse all items in group to one entry')

cols = Param(parent='undefined', name='cols', doc='Cols to ensemble')

getColNames()[source]

Returns: Names of the result of each col
Return type: colNames

getCollapseGroup()[source]

Returns: Whether to collapse all items in group to one entry
Return type: collapseGroup

getCols()[source]

Returns: Cols to ensemble
Return type: cols

static getJavaPackage()[source]: Returns package name String.

getKeys()[source]

Returns: Keys to group by
Return type: keys

getStrategy()[source]

Returns: How to ensemble the scores, ex: mean
Return type: strategy

getVectorDims()[source]

Returns: the dimensions of any vector columns, used to avoid materialization
Return type: vectorDims

keys = Param(parent='undefined', name='keys', doc='Keys to group by')

classmethod read()[source]: Returns an MLReader instance for this class.

setColNames(value)[source]

Parameters: colNames¶ – Names of the result of each col

setCollapseGroup(value)[source]

Parameters: collapseGroup¶ – Whether to collapse all items in group to one entry

setCols(value)[source]

Parameters: cols¶ – Cols to ensemble

setKeys(value)[source]

Parameters: keys¶ – Keys to group by

setParams(colNames=None, collapseGroup=True, cols=None, keys=None, strategy='mean', vectorDims=None)[source]: Set the (keyword only) parameters

setStrategy(value)[source]

Parameters: strategy¶ – How to ensemble the scores, ex: mean

setVectorDims(value)[source]

Parameters: vectorDims¶ – the dimensions of any vector columns, used to avoid materialization

strategy = Param(parent='undefined', name='strategy', doc='How to ensemble the scores, ex: mean')

vectorDims = Param(parent='undefined', name='vectorDims', doc='the dimensions of any vector columns, used to avoid materialization')

synapse.ml.stages.Explode module

class synapse.ml.stages.Explode.Explode(java_obj=None, inputCol=None, outputCol='Explode_8102a0bb2741_output')[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters

inputCol¶ (str) – The name of the input column
outputCol¶ (str) – The name of the output column

getInputCol()[source]

Returns: The name of the input column
Return type: inputCol

static getJavaPackage()[source]: Returns package name String.

getOutputCol()[source]

Returns: The name of the output column
Return type: outputCol

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')

outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')

classmethod read()[source]: Returns an MLReader instance for this class.

setInputCol(value)[source]

Parameters: inputCol¶ – The name of the input column

setOutputCol(value)[source]

Parameters: outputCol¶ – The name of the output column

setParams(inputCol=None, outputCol='Explode_8102a0bb2741_output')[source]: Set the (keyword only) parameters

synapse.ml.stages.FixedMiniBatchTransformer module

class synapse.ml.stages.FixedMiniBatchTransformer.FixedMiniBatchTransformer(java_obj=None, batchSize=None, buffered=False, maxBufferSize=2147483647)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters

batchSize¶ (int) – The max size of the buffer
buffered¶ (bool) – Whether or not to buffer batches in memory
maxBufferSize¶ (int) – The max size of the buffer

batchSize = Param(parent='undefined', name='batchSize', doc='The max size of the buffer')

buffered = Param(parent='undefined', name='buffered', doc='Whether or not to buffer batches in memory')

getBatchSize()[source]

Returns: The max size of the buffer
Return type: batchSize

getBuffered()[source]

Returns: Whether or not to buffer batches in memory
Return type: buffered

static getJavaPackage()[source]: Returns package name String.

getMaxBufferSize()[source]

Returns: The max size of the buffer
Return type: maxBufferSize

maxBufferSize = Param(parent='undefined', name='maxBufferSize', doc='The max size of the buffer')

classmethod read()[source]: Returns an MLReader instance for this class.

setBatchSize(value)[source]

Parameters: batchSize¶ – The max size of the buffer

setBuffered(value)[source]

Parameters: buffered¶ – Whether or not to buffer batches in memory

setMaxBufferSize(value)[source]

Parameters: maxBufferSize¶ – The max size of the buffer

setParams(batchSize=None, buffered=False, maxBufferSize=2147483647)[source]: Set the (keyword only) parameters

synapse.ml.stages.FlattenBatch module

class synapse.ml.stages.FlattenBatch.FlattenBatch(java_obj=None)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Args:

static getJavaPackage()[source]: Returns package name String.

classmethod read()[source]: Returns an MLReader instance for this class.

setParams()[source]: Set the (keyword only) parameters

synapse.ml.stages.Lambda module

class synapse.ml.stages.Lambda.Lambda(java_obj=None, transformFunc=None, transformSchemaFunc=None)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters

transformFunc¶ (object) – holder for dataframe function
transformSchemaFunc¶ (object) – the output schema after the transformation

static getJavaPackage()[source]: Returns package name String.

getTransformFunc()[source]

Returns: holder for dataframe function
Return type: transformFunc

getTransformSchemaFunc()[source]

Returns: the output schema after the transformation
Return type: transformSchemaFunc

classmethod read()[source]: Returns an MLReader instance for this class.

setParams(transformFunc=None, transformSchemaFunc=None)[source]: Set the (keyword only) parameters

setTransformFunc(value)[source]

Parameters: transformFunc¶ – holder for dataframe function

setTransformSchemaFunc(value)[source]

Parameters: transformSchemaFunc¶ – the output schema after the transformation

transformFunc = Param(parent='undefined', name='transformFunc', doc='holder for dataframe function')

transformSchemaFunc = Param(parent='undefined', name='transformSchemaFunc', doc='the output schema after the transformation')

synapse.ml.stages.MultiColumnAdapter module

class synapse.ml.stages.MultiColumnAdapter.MultiColumnAdapter(java_obj=None, baseStage=None, inputCols=None, outputCols=None)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaEstimator

Parameters

baseStage¶ (object) – base pipeline stage to apply to every column
inputCols¶ (list) – list of column names encoded as a string
outputCols¶ (list) – list of column names encoded as a string

baseStage = Param(parent='undefined', name='baseStage', doc='base pipeline stage to apply to every column')

getBaseStage()[source]

Returns: base pipeline stage to apply to every column
Return type: baseStage

getInputCols()[source]

Returns: list of column names encoded as a string
Return type: inputCols

static getJavaPackage()[source]: Returns package name String.

getOutputCols()[source]

Returns: list of column names encoded as a string
Return type: outputCols

inputCols = Param(parent='undefined', name='inputCols', doc='list of column names encoded as a string')

outputCols = Param(parent='undefined', name='outputCols', doc='list of column names encoded as a string')

classmethod read()[source]: Returns an MLReader instance for this class.

setBaseStage(value)[source]

Parameters: baseStage¶ – base pipeline stage to apply to every column

setInputCols(value)[source]

Parameters: inputCols¶ – list of column names encoded as a string

setOutputCols(value)[source]

Parameters: outputCols¶ – list of column names encoded as a string

setParams(baseStage=None, inputCols=None, outputCols=None)[source]: Set the (keyword only) parameters

synapse.ml.stages.PartitionConsolidator module

class synapse.ml.stages.PartitionConsolidator.PartitionConsolidator(java_obj=None, concurrency=1, concurrentTimeout=None, inputCol=None, outputCol=None, timeout=60.0)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters

concurrency¶ (int) – max number of concurrent calls
concurrentTimeout¶ (float) – max number seconds to wait on futures if concurrency >= 1
inputCol¶ (str) – The name of the input column
outputCol¶ (str) – The name of the output column
timeout¶ (float) – number of seconds to wait before closing the connection

concurrency = Param(parent='undefined', name='concurrency', doc='max number of concurrent calls')

concurrentTimeout = Param(parent='undefined', name='concurrentTimeout', doc='max number seconds to wait on futures if concurrency >= 1')

getConcurrency()[source]

Returns: max number of concurrent calls
Return type: concurrency

getConcurrentTimeout()[source]

Returns: max number seconds to wait on futures if concurrency >= 1
Return type: concurrentTimeout

getInputCol()[source]

Returns: The name of the input column
Return type: inputCol

static getJavaPackage()[source]: Returns package name String.

getOutputCol()[source]

Returns: The name of the output column
Return type: outputCol

getTimeout()[source]

Returns: number of seconds to wait before closing the connection
Return type: timeout

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')

outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')

classmethod read()[source]: Returns an MLReader instance for this class.

setConcurrency(value)[source]

Parameters: concurrency¶ – max number of concurrent calls

setConcurrentTimeout(value)[source]

Parameters: concurrentTimeout¶ – max number seconds to wait on futures if concurrency >= 1

setInputCol(value)[source]

Parameters: inputCol¶ – The name of the input column

setOutputCol(value)[source]

Parameters: outputCol¶ – The name of the output column

setParams(concurrency=1, concurrentTimeout=None, inputCol=None, outputCol=None, timeout=60.0)[source]: Set the (keyword only) parameters

setTimeout(value)[source]

Parameters: timeout¶ – number of seconds to wait before closing the connection

timeout = Param(parent='undefined', name='timeout', doc='number of seconds to wait before closing the connection')

synapse.ml.stages.RenameColumn module

class synapse.ml.stages.RenameColumn.RenameColumn(java_obj=None, inputCol=None, outputCol=None)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters

inputCol¶ (str) – The name of the input column
outputCol¶ (str) – The name of the output column

getInputCol()[source]

Returns: The name of the input column
Return type: inputCol

static getJavaPackage()[source]: Returns package name String.

getOutputCol()[source]

Returns: The name of the output column
Return type: outputCol

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')

outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')

classmethod read()[source]: Returns an MLReader instance for this class.

setInputCol(value)[source]

Parameters: inputCol¶ – The name of the input column

setOutputCol(value)[source]

Parameters: outputCol¶ – The name of the output column

setParams(inputCol=None, outputCol=None)[source]: Set the (keyword only) parameters

synapse.ml.stages.Repartition module

class synapse.ml.stages.Repartition.Repartition(java_obj=None, disable=False, n=None)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters

disable¶ (bool) – Whether to disable repartitioning (so that one can turn it off for evaluation)
n¶ (int) – Number of partitions

disable = Param(parent='undefined', name='disable', doc='Whether to disable repartitioning (so that one can turn it off for evaluation)')

getDisable()[source]

Returns: Whether to disable repartitioning (so that one can turn it off for evaluation)
Return type: disable

static getJavaPackage()[source]: Returns package name String.

getN()[source]

Returns: Number of partitions
Return type: n

n = Param(parent='undefined', name='n', doc='Number of partitions')

classmethod read()[source]: Returns an MLReader instance for this class.

setDisable(value)[source]

Parameters: disable¶ – Whether to disable repartitioning (so that one can turn it off for evaluation)

setN(value)[source]

Parameters: n¶ – Number of partitions

setParams(disable=False, n=None)[source]: Set the (keyword only) parameters

synapse.ml.stages.SelectColumns module

class synapse.ml.stages.SelectColumns.SelectColumns(java_obj=None, cols=None)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters: cols¶ (list) – Comma separated list of selected column names

cols = Param(parent='undefined', name='cols', doc='Comma separated list of selected column names')

getCols()[source]

Returns: Comma separated list of selected column names
Return type: cols

static getJavaPackage()[source]: Returns package name String.

classmethod read()[source]: Returns an MLReader instance for this class.

setCols(value)[source]

Parameters: cols¶ – Comma separated list of selected column names

setParams(cols=None)[source]: Set the (keyword only) parameters

synapse.ml.stages.StratifiedRepartition module

class synapse.ml.stages.StratifiedRepartition.StratifiedRepartition(java_obj=None, labelCol=None, mode='mixed', seed=1518410069)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters

labelCol¶ (str) – The name of the label column
mode¶ (str) – Specify equal to repartition with replacement across all labels, specify original to keep the ratios in the original dataset, or specify mixed to use a heuristic
seed¶ (long) – random seed

static getJavaPackage()[source]: Returns package name String.

getLabelCol()[source]

Returns: The name of the label column
Return type: labelCol

getMode()[source]

Returns: Specify equal to repartition with replacement across all labels, specify original to keep the ratios in the original dataset, or specify mixed to use a heuristic
Return type: mode

getSeed()[source]

Returns: random seed
Return type: seed

labelCol = Param(parent='undefined', name='labelCol', doc='The name of the label column')

mode = Param(parent='undefined', name='mode', doc='Specify equal to repartition with replacement across all labels, specify original to keep the ratios in the original dataset, or specify mixed to use a heuristic')

classmethod read()[source]: Returns an MLReader instance for this class.

seed = Param(parent='undefined', name='seed', doc='random seed')

setLabelCol(value)[source]

Parameters: labelCol¶ – The name of the label column

setMode(value)[source]

Parameters: mode¶ – Specify equal to repartition with replacement across all labels, specify original to keep the ratios in the original dataset, or specify mixed to use a heuristic

setParams(labelCol=None, mode='mixed', seed=1518410069)[source]: Set the (keyword only) parameters

setSeed(value)[source]

Parameters: seed¶ – random seed

synapse.ml.stages.SummarizeData module

class synapse.ml.stages.SummarizeData.SummarizeData(java_obj=None, basic=True, counts=True, errorThreshold=0.0, percentiles=True, sample=True)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters

basic¶ (bool) – Compute basic statistics
counts¶ (bool) – Compute count statistics
errorThreshold¶ (float) – Threshold for quantiles - 0 is exact
percentiles¶ (bool) – Compute percentiles
sample¶ (bool) – Compute sample statistics

basic = Param(parent='undefined', name='basic', doc='Compute basic statistics')

counts = Param(parent='undefined', name='counts', doc='Compute count statistics')

errorThreshold = Param(parent='undefined', name='errorThreshold', doc='Threshold for quantiles - 0 is exact')

getBasic()[source]

Returns: Compute basic statistics
Return type: basic

getCounts()[source]

Returns: Compute count statistics
Return type: counts

getErrorThreshold()[source]

Returns: Threshold for quantiles - 0 is exact
Return type: errorThreshold

static getJavaPackage()[source]: Returns package name String.

getPercentiles()[source]

Returns: Compute percentiles
Return type: percentiles

getSample()[source]

Returns: Compute sample statistics
Return type: sample

percentiles = Param(parent='undefined', name='percentiles', doc='Compute percentiles')

classmethod read()[source]: Returns an MLReader instance for this class.

sample = Param(parent='undefined', name='sample', doc='Compute sample statistics')

setBasic(value)[source]

Parameters: basic¶ – Compute basic statistics

setCounts(value)[source]

Parameters: counts¶ – Compute count statistics

setErrorThreshold(value)[source]

Parameters: errorThreshold¶ – Threshold for quantiles - 0 is exact

setParams(basic=True, counts=True, errorThreshold=0.0, percentiles=True, sample=True)[source]: Set the (keyword only) parameters

setPercentiles(value)[source]

Parameters: percentiles¶ – Compute percentiles

setSample(value)[source]

Parameters: sample¶ – Compute sample statistics

synapse.ml.stages.TextPreprocessor module

class synapse.ml.stages.TextPreprocessor.TextPreprocessor(java_obj=None, inputCol=None, map=None, normFunc=None, outputCol=None)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters

inputCol¶ (str) – The name of the input column
map¶ (dict) – Map of substring match to replacement
normFunc¶ (str) – Name of normalization function to apply
outputCol¶ (str) – The name of the output column

getInputCol()[source]

Returns: The name of the input column
Return type: inputCol

static getJavaPackage()[source]: Returns package name String.

getMap()[source]

Returns: Map of substring match to replacement
Return type: map

getNormFunc()[source]

Returns: Name of normalization function to apply
Return type: normFunc

getOutputCol()[source]

Returns: The name of the output column
Return type: outputCol

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')

map = Param(parent='undefined', name='map', doc='Map of substring match to replacement')

normFunc = Param(parent='undefined', name='normFunc', doc='Name of normalization function to apply')

outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')

classmethod read()[source]: Returns an MLReader instance for this class.

setInputCol(value)[source]

Parameters: inputCol¶ – The name of the input column

setMap(value)[source]

Parameters: map¶ – Map of substring match to replacement

setNormFunc(value)[source]

Parameters: normFunc¶ – Name of normalization function to apply

setOutputCol(value)[source]

Parameters: outputCol¶ – The name of the output column

setParams(inputCol=None, map=None, normFunc=None, outputCol=None)[source]: Set the (keyword only) parameters

synapse.ml.stages.TimeIntervalMiniBatchTransformer module

class synapse.ml.stages.TimeIntervalMiniBatchTransformer.TimeIntervalMiniBatchTransformer(java_obj=None, maxBatchSize=2147483647, millisToWait=None)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters

maxBatchSize¶ (int) – The max size of the buffer
millisToWait¶ (int) – The time to wait before constructing a batch

static getJavaPackage()[source]: Returns package name String.

getMaxBatchSize()[source]

Returns: The max size of the buffer
Return type: maxBatchSize

getMillisToWait()[source]

Returns: The time to wait before constructing a batch
Return type: millisToWait

maxBatchSize = Param(parent='undefined', name='maxBatchSize', doc='The max size of the buffer')

millisToWait = Param(parent='undefined', name='millisToWait', doc='The time to wait before constructing a batch')

classmethod read()[source]: Returns an MLReader instance for this class.

setMaxBatchSize(value)[source]

Parameters: maxBatchSize¶ – The max size of the buffer

setMillisToWait(value)[source]

Parameters: millisToWait¶ – The time to wait before constructing a batch

setParams(maxBatchSize=2147483647, millisToWait=None)[source]: Set the (keyword only) parameters

synapse.ml.stages.Timer module

class synapse.ml.stages.Timer.Timer(java_obj=None, disableMaterialization=True, logToScala=True, stage=None)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaEstimator

Parameters

disableMaterialization¶ (bool) – Whether to disable timing (so that one can turn it off for evaluation)
logToScala¶ (bool) – Whether to output the time to the scala console
stage¶ (object) – The stage to time

disableMaterialization = Param(parent='undefined', name='disableMaterialization', doc='Whether to disable timing (so that one can turn it off for evaluation)')

getDisableMaterialization()[source]

Returns: Whether to disable timing (so that one can turn it off for evaluation)
Return type: disableMaterialization

static getJavaPackage()[source]: Returns package name String.

getLogToScala()[source]

Returns: Whether to output the time to the scala console
Return type: logToScala

getStage()[source]

Returns: The stage to time
Return type: stage

logToScala = Param(parent='undefined', name='logToScala', doc='Whether to output the time to the scala console')

classmethod read()[source]: Returns an MLReader instance for this class.

setDisableMaterialization(value)[source]

Parameters: disableMaterialization¶ – Whether to disable timing (so that one can turn it off for evaluation)

setLogToScala(value)[source]

Parameters: logToScala¶ – Whether to output the time to the scala console

setParams(disableMaterialization=True, logToScala=True, stage=None)[source]: Set the (keyword only) parameters

setStage(value)[source]

Parameters: stage¶ – The stage to time

stage = Param(parent='undefined', name='stage', doc='The stage to time')

synapse.ml.stages.TimerModel module

class synapse.ml.stages.TimerModel.TimerModel(java_obj=None, disableMaterialization=True, logToScala=True, stage=None, transformer=None)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaModel

Parameters

disableMaterialization¶ (bool) – Whether to disable timing (so that one can turn it off for evaluation)
logToScala¶ (bool) – Whether to output the time to the scala console
stage¶ (object) – The stage to time
transformer¶ (object) – inner model to time

disableMaterialization = Param(parent='undefined', name='disableMaterialization', doc='Whether to disable timing (so that one can turn it off for evaluation)')

getDisableMaterialization()[source]

Returns: Whether to disable timing (so that one can turn it off for evaluation)
Return type: disableMaterialization

static getJavaPackage()[source]: Returns package name String.

getLogToScala()[source]

Returns: Whether to output the time to the scala console
Return type: logToScala

getStage()[source]

Returns: The stage to time
Return type: stage

getTransformer()[source]

Returns: inner model to time
Return type: transformer

logToScala = Param(parent='undefined', name='logToScala', doc='Whether to output the time to the scala console')

classmethod read()[source]: Returns an MLReader instance for this class.

setDisableMaterialization(value)[source]

Parameters: disableMaterialization¶ – Whether to disable timing (so that one can turn it off for evaluation)

setLogToScala(value)[source]

Parameters: logToScala¶ – Whether to output the time to the scala console

setParams(disableMaterialization=True, logToScala=True, stage=None, transformer=None)[source]: Set the (keyword only) parameters

setStage(value)[source]

Parameters: stage¶ – The stage to time

setTransformer(value)[source]

Parameters: transformer¶ – inner model to time

stage = Param(parent='undefined', name='stage', doc='The stage to time')

transformer = Param(parent='undefined', name='transformer', doc='inner model to time')

synapse.ml.stages.UDFTransformer module

class synapse.ml.stages.UDFTransformer.UDFTransformer(inputCol=None, inputCols=None, outputCol=None, udf=None)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters

inputCol¶ (str) – The name of the input column (default: )
outputCol¶ (str) – The name of the output column
udf¶ (object) – User Defined Python Function to be applied to the DF input col
udfScala¶ (object) – User Defined Function to be applied to the DF input col

getInputCol()[source]

Returns: The name of the input column (default: )
Return type: str

getInputCols()[source]

Returns: The name of the input column (default: )
Return type: str

static getJavaPackage()[source]: Returns package name String.

getOutputCol()[source]

Returns: The name of the output column
Return type: str

getUDF()[source]

classmethod read()[source]: Returns an MLReader instance for this class.

setInputCol(value)[source]

Parameters: inputCol¶ (str) – The name of the input column (default: )

setInputCols(value)[source]

Parameters: inputCols¶ (list) – The names of the input columns (default: )

setOutputCol(value)[source]

Parameters: outputCol¶ (str) – The name of the output column

setUDF(udf)[source]

synapse.ml.stages.UnicodeNormalize module

class synapse.ml.stages.UnicodeNormalize.UnicodeNormalize(java_obj=None, form=None, inputCol=None, lower=None, outputCol=None)[source]

Bases: synapse.ml.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters

form¶ (str) – Unicode normalization form: NFC, NFD, NFKC, NFKD
inputCol¶ (str) – The name of the input column
lower¶ (bool) – Lowercase text
outputCol¶ (str) – The name of the output column

form = Param(parent='undefined', name='form', doc='Unicode normalization form: NFC, NFD, NFKC, NFKD')

getForm()[source]

Returns: Unicode normalization form: NFC, NFD, NFKC, NFKD
Return type: form

getInputCol()[source]

Returns: The name of the input column
Return type: inputCol

static getJavaPackage()[source]: Returns package name String.

getLower()[source]

Returns: Lowercase text
Return type: lower

getOutputCol()[source]

Returns: The name of the output column
Return type: outputCol

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')

lower = Param(parent='undefined', name='lower', doc='Lowercase text')

outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')

classmethod read()[source]: Returns an MLReader instance for this class.

setForm(value)[source]

Parameters: form¶ – Unicode normalization form: NFC, NFD, NFKC, NFKD

setInputCol(value)[source]

Parameters: inputCol¶ – The name of the input column

setLower(value)[source]

Parameters: lower¶ – Lowercase text

setOutputCol(value)[source]

Parameters: outputCol¶ – The name of the output column

setParams(form=None, inputCol=None, lower=None, outputCol=None)[source]: Set the (keyword only) parameters

Module contents

SynapseML is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark in several new directions. SynapseML adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV. These tools enable powerful and highly-scalable predictive and analytical models for a variety of datasources.

SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. In this vein, SynapseML provides easy to use SparkML transformers for a wide variety of Microsoft Cognitive Services. For production grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.

SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+.