synapse.ml.stages package

Submodules

synapse.ml.stages.Cacher module

class synapse.ml.stages.Cacher.Cacher(java_obj=None, disable=False)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters

disable (bool) – Whether or disable caching (so that you can turn it off during evaluation)

disable = Param(parent='undefined', name='disable', doc='Whether or disable caching (so that you can turn it off during evaluation)')
getDisable()[source]
Returns

Whether or disable caching (so that you can turn it off during evaluation)

Return type

disable

static getJavaPackage()[source]

Returns package name String.

classmethod read()[source]

Returns an MLReader instance for this class.

setDisable(value)[source]
Parameters

disable – Whether or disable caching (so that you can turn it off during evaluation)

setParams(disable=False)[source]

Set the (keyword only) parameters

synapse.ml.stages.ClassBalancer module

class synapse.ml.stages.ClassBalancer.ClassBalancer(java_obj=None, broadcastJoin=True, inputCol=None, outputCol='weight')[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • broadcastJoin (bool) – Whether to broadcast the class to weight mapping to the worker

  • inputCol (str) – The name of the input column

  • outputCol (str) – The name of the output column

broadcastJoin = Param(parent='undefined', name='broadcastJoin', doc='Whether to broadcast the class to weight mapping to the worker')
getBroadcastJoin()[source]
Returns

Whether to broadcast the class to weight mapping to the worker

Return type

broadcastJoin

getInputCol()[source]
Returns

The name of the input column

Return type

inputCol

static getJavaPackage()[source]

Returns package name String.

getOutputCol()[source]
Returns

The name of the output column

Return type

outputCol

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
classmethod read()[source]

Returns an MLReader instance for this class.

setBroadcastJoin(value)[source]
Parameters

broadcastJoin – Whether to broadcast the class to weight mapping to the worker

setInputCol(value)[source]
Parameters

inputCol – The name of the input column

setOutputCol(value)[source]
Parameters

outputCol – The name of the output column

setParams(broadcastJoin=True, inputCol=None, outputCol='weight')[source]

Set the (keyword only) parameters

synapse.ml.stages.ClassBalancerModel module

class synapse.ml.stages.ClassBalancerModel.ClassBalancerModel(java_obj=None, broadcastJoin=None, inputCol=None, outputCol=None, weights=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • broadcastJoin (bool) – whether to broadcast join

  • inputCol (str) – The name of the input column

  • outputCol (str) – The name of the output column

  • weights (object) – the dataframe of weights

broadcastJoin = Param(parent='undefined', name='broadcastJoin', doc='whether to broadcast join')
getBroadcastJoin()[source]
Returns

whether to broadcast join

Return type

broadcastJoin

getInputCol()[source]
Returns

The name of the input column

Return type

inputCol

static getJavaPackage()[source]

Returns package name String.

getOutputCol()[source]
Returns

The name of the output column

Return type

outputCol

getWeights()[source]
Returns

the dataframe of weights

Return type

weights

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
classmethod read()[source]

Returns an MLReader instance for this class.

setBroadcastJoin(value)[source]
Parameters

broadcastJoin – whether to broadcast join

setInputCol(value)[source]
Parameters

inputCol – The name of the input column

setOutputCol(value)[source]
Parameters

outputCol – The name of the output column

setParams(broadcastJoin=None, inputCol=None, outputCol=None, weights=None)[source]

Set the (keyword only) parameters

setWeights(value)[source]
Parameters

weights – the dataframe of weights

weights = Param(parent='undefined', name='weights', doc='the dataframe of weights')

synapse.ml.stages.DropColumns module

class synapse.ml.stages.DropColumns.DropColumns(java_obj=None, cols=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters

cols (list) – Comma separated list of column names

cols = Param(parent='undefined', name='cols', doc='Comma separated list of column names')
getCols()[source]
Returns

Comma separated list of column names

Return type

cols

static getJavaPackage()[source]

Returns package name String.

classmethod read()[source]

Returns an MLReader instance for this class.

setCols(value)[source]
Parameters

cols – Comma separated list of column names

setParams(cols=None)[source]

Set the (keyword only) parameters

synapse.ml.stages.DynamicMiniBatchTransformer module

class synapse.ml.stages.DynamicMiniBatchTransformer.DynamicMiniBatchTransformer(java_obj=None, maxBatchSize=2147483647)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters

maxBatchSize (int) – The max size of the buffer

static getJavaPackage()[source]

Returns package name String.

getMaxBatchSize()[source]
Returns

The max size of the buffer

Return type

maxBatchSize

maxBatchSize = Param(parent='undefined', name='maxBatchSize', doc='The max size of the buffer')
classmethod read()[source]

Returns an MLReader instance for this class.

setMaxBatchSize(value)[source]
Parameters

maxBatchSize – The max size of the buffer

setParams(maxBatchSize=2147483647)[source]

Set the (keyword only) parameters

synapse.ml.stages.EnsembleByKey module

class synapse.ml.stages.EnsembleByKey.EnsembleByKey(java_obj=None, colNames=None, collapseGroup=True, cols=None, keys=None, strategy='mean', vectorDims=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • colNames (list) – Names of the result of each col

  • collapseGroup (bool) – Whether to collapse all items in group to one entry

  • cols (list) – Cols to ensemble

  • keys (list) – Keys to group by

  • strategy (str) – How to ensemble the scores, ex: mean

  • vectorDims (dict) – the dimensions of any vector columns, used to avoid materialization

colNames = Param(parent='undefined', name='colNames', doc='Names of the result of each col')
collapseGroup = Param(parent='undefined', name='collapseGroup', doc='Whether to collapse all items in group to one entry')
cols = Param(parent='undefined', name='cols', doc='Cols to ensemble')
getColNames()[source]
Returns

Names of the result of each col

Return type

colNames

getCollapseGroup()[source]
Returns

Whether to collapse all items in group to one entry

Return type

collapseGroup

getCols()[source]
Returns

Cols to ensemble

Return type

cols

static getJavaPackage()[source]

Returns package name String.

getKeys()[source]
Returns

Keys to group by

Return type

keys

getStrategy()[source]
Returns

How to ensemble the scores, ex: mean

Return type

strategy

getVectorDims()[source]
Returns

the dimensions of any vector columns, used to avoid materialization

Return type

vectorDims

keys = Param(parent='undefined', name='keys', doc='Keys to group by')
classmethod read()[source]

Returns an MLReader instance for this class.

setColNames(value)[source]
Parameters

colNames – Names of the result of each col

setCollapseGroup(value)[source]
Parameters

collapseGroup – Whether to collapse all items in group to one entry

setCols(value)[source]
Parameters

cols – Cols to ensemble

setKeys(value)[source]
Parameters

keys – Keys to group by

setParams(colNames=None, collapseGroup=True, cols=None, keys=None, strategy='mean', vectorDims=None)[source]

Set the (keyword only) parameters

setStrategy(value)[source]
Parameters

strategy – How to ensemble the scores, ex: mean

setVectorDims(value)[source]
Parameters

vectorDims – the dimensions of any vector columns, used to avoid materialization

strategy = Param(parent='undefined', name='strategy', doc='How to ensemble the scores, ex: mean')
vectorDims = Param(parent='undefined', name='vectorDims', doc='the dimensions of any vector columns, used to avoid materialization')

synapse.ml.stages.Explode module

class synapse.ml.stages.Explode.Explode(java_obj=None, inputCol=None, outputCol='Explode_413b65fe12a2_output')[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • inputCol (str) – The name of the input column

  • outputCol (str) – The name of the output column

getInputCol()[source]
Returns

The name of the input column

Return type

inputCol

static getJavaPackage()[source]

Returns package name String.

getOutputCol()[source]
Returns

The name of the output column

Return type

outputCol

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
classmethod read()[source]

Returns an MLReader instance for this class.

setInputCol(value)[source]
Parameters

inputCol – The name of the input column

setOutputCol(value)[source]
Parameters

outputCol – The name of the output column

setParams(inputCol=None, outputCol='Explode_413b65fe12a2_output')[source]

Set the (keyword only) parameters

synapse.ml.stages.FixedMiniBatchTransformer module

class synapse.ml.stages.FixedMiniBatchTransformer.FixedMiniBatchTransformer(java_obj=None, batchSize=None, buffered=False, maxBufferSize=2147483647)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • batchSize (int) – The max size of the buffer

  • buffered (bool) – Whether or not to buffer batches in memory

  • maxBufferSize (int) – The max size of the buffer

batchSize = Param(parent='undefined', name='batchSize', doc='The max size of the buffer')
buffered = Param(parent='undefined', name='buffered', doc='Whether or not to buffer batches in memory')
getBatchSize()[source]
Returns

The max size of the buffer

Return type

batchSize

getBuffered()[source]
Returns

Whether or not to buffer batches in memory

Return type

buffered

static getJavaPackage()[source]

Returns package name String.

getMaxBufferSize()[source]
Returns

The max size of the buffer

Return type

maxBufferSize

maxBufferSize = Param(parent='undefined', name='maxBufferSize', doc='The max size of the buffer')
classmethod read()[source]

Returns an MLReader instance for this class.

setBatchSize(value)[source]
Parameters

batchSize – The max size of the buffer

setBuffered(value)[source]
Parameters

buffered – Whether or not to buffer batches in memory

setMaxBufferSize(value)[source]
Parameters

maxBufferSize – The max size of the buffer

setParams(batchSize=None, buffered=False, maxBufferSize=2147483647)[source]

Set the (keyword only) parameters

synapse.ml.stages.FlattenBatch module

class synapse.ml.stages.FlattenBatch.FlattenBatch(java_obj=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Args:

static getJavaPackage()[source]

Returns package name String.

classmethod read()[source]

Returns an MLReader instance for this class.

setParams()[source]

Set the (keyword only) parameters

synapse.ml.stages.Lambda module

class synapse.ml.stages.Lambda.Lambda(java_obj=None, transformFunc=None, transformSchemaFunc=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • transformFunc (object) – holder for dataframe function

  • transformSchemaFunc (object) – the output schema after the transformation

static getJavaPackage()[source]

Returns package name String.

getTransformFunc()[source]
Returns

holder for dataframe function

Return type

transformFunc

getTransformSchemaFunc()[source]
Returns

the output schema after the transformation

Return type

transformSchemaFunc

classmethod read()[source]

Returns an MLReader instance for this class.

setParams(transformFunc=None, transformSchemaFunc=None)[source]

Set the (keyword only) parameters

setTransformFunc(value)[source]
Parameters

transformFunc – holder for dataframe function

setTransformSchemaFunc(value)[source]
Parameters

transformSchemaFunc – the output schema after the transformation

transformFunc = Param(parent='undefined', name='transformFunc', doc='holder for dataframe function')
transformSchemaFunc = Param(parent='undefined', name='transformSchemaFunc', doc='the output schema after the transformation')

synapse.ml.stages.MultiColumnAdapter module

class synapse.ml.stages.MultiColumnAdapter.MultiColumnAdapter(java_obj=None, baseStage=None, inputCols=None, outputCols=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • baseStage (object) – base pipeline stage to apply to every column

  • inputCols (list) – list of column names encoded as a string

  • outputCols (list) – list of column names encoded as a string

baseStage = Param(parent='undefined', name='baseStage', doc='base pipeline stage to apply to every column')
getBaseStage()[source]
Returns

base pipeline stage to apply to every column

Return type

baseStage

getInputCols()[source]
Returns

list of column names encoded as a string

Return type

inputCols

static getJavaPackage()[source]

Returns package name String.

getOutputCols()[source]
Returns

list of column names encoded as a string

Return type

outputCols

inputCols = Param(parent='undefined', name='inputCols', doc='list of column names encoded as a string')
outputCols = Param(parent='undefined', name='outputCols', doc='list of column names encoded as a string')
classmethod read()[source]

Returns an MLReader instance for this class.

setBaseStage(value)[source]
Parameters

baseStage – base pipeline stage to apply to every column

setInputCols(value)[source]
Parameters

inputCols – list of column names encoded as a string

setOutputCols(value)[source]
Parameters

outputCols – list of column names encoded as a string

setParams(baseStage=None, inputCols=None, outputCols=None)[source]

Set the (keyword only) parameters

synapse.ml.stages.PartitionConsolidator module

class synapse.ml.stages.PartitionConsolidator.PartitionConsolidator(java_obj=None, concurrency=1, concurrentTimeout=None, inputCol=None, outputCol=None, timeout=60.0)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • concurrency (int) – max number of concurrent calls

  • concurrentTimeout (float) – max number seconds to wait on futures if concurrency >= 1

  • inputCol (str) – The name of the input column

  • outputCol (str) – The name of the output column

  • timeout (float) – number of seconds to wait before closing the connection

concurrency = Param(parent='undefined', name='concurrency', doc='max number of concurrent calls')
concurrentTimeout = Param(parent='undefined', name='concurrentTimeout', doc='max number seconds to wait on futures if concurrency >= 1')
getConcurrency()[source]
Returns

max number of concurrent calls

Return type

concurrency

getConcurrentTimeout()[source]
Returns

max number seconds to wait on futures if concurrency >= 1

Return type

concurrentTimeout

getInputCol()[source]
Returns

The name of the input column

Return type

inputCol

static getJavaPackage()[source]

Returns package name String.

getOutputCol()[source]
Returns

The name of the output column

Return type

outputCol

getTimeout()[source]
Returns

number of seconds to wait before closing the connection

Return type

timeout

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
classmethod read()[source]

Returns an MLReader instance for this class.

setConcurrency(value)[source]
Parameters

concurrency – max number of concurrent calls

setConcurrentTimeout(value)[source]
Parameters

concurrentTimeout – max number seconds to wait on futures if concurrency >= 1

setInputCol(value)[source]
Parameters

inputCol – The name of the input column

setOutputCol(value)[source]
Parameters

outputCol – The name of the output column

setParams(concurrency=1, concurrentTimeout=None, inputCol=None, outputCol=None, timeout=60.0)[source]

Set the (keyword only) parameters

setTimeout(value)[source]
Parameters

timeout – number of seconds to wait before closing the connection

timeout = Param(parent='undefined', name='timeout', doc='number of seconds to wait before closing the connection')

synapse.ml.stages.RenameColumn module

class synapse.ml.stages.RenameColumn.RenameColumn(java_obj=None, inputCol=None, outputCol=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • inputCol (str) – The name of the input column

  • outputCol (str) – The name of the output column

getInputCol()[source]
Returns

The name of the input column

Return type

inputCol

static getJavaPackage()[source]

Returns package name String.

getOutputCol()[source]
Returns

The name of the output column

Return type

outputCol

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
classmethod read()[source]

Returns an MLReader instance for this class.

setInputCol(value)[source]
Parameters

inputCol – The name of the input column

setOutputCol(value)[source]
Parameters

outputCol – The name of the output column

setParams(inputCol=None, outputCol=None)[source]

Set the (keyword only) parameters

synapse.ml.stages.Repartition module

class synapse.ml.stages.Repartition.Repartition(java_obj=None, disable=False, n=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • disable (bool) – Whether to disable repartitioning (so that one can turn it off for evaluation)

  • n (int) – Number of partitions

disable = Param(parent='undefined', name='disable', doc='Whether to disable repartitioning (so that one can turn it off for evaluation)')
getDisable()[source]
Returns

Whether to disable repartitioning (so that one can turn it off for evaluation)

Return type

disable

static getJavaPackage()[source]

Returns package name String.

getN()[source]
Returns

Number of partitions

Return type

n

n = Param(parent='undefined', name='n', doc='Number of partitions')
classmethod read()[source]

Returns an MLReader instance for this class.

setDisable(value)[source]
Parameters

disable – Whether to disable repartitioning (so that one can turn it off for evaluation)

setN(value)[source]
Parameters

n – Number of partitions

setParams(disable=False, n=None)[source]

Set the (keyword only) parameters

synapse.ml.stages.SelectColumns module

class synapse.ml.stages.SelectColumns.SelectColumns(java_obj=None, cols=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters

cols (list) – Comma separated list of selected column names

cols = Param(parent='undefined', name='cols', doc='Comma separated list of selected column names')
getCols()[source]
Returns

Comma separated list of selected column names

Return type

cols

static getJavaPackage()[source]

Returns package name String.

classmethod read()[source]

Returns an MLReader instance for this class.

setCols(value)[source]
Parameters

cols – Comma separated list of selected column names

setParams(cols=None)[source]

Set the (keyword only) parameters

synapse.ml.stages.StratifiedRepartition module

class synapse.ml.stages.StratifiedRepartition.StratifiedRepartition(java_obj=None, labelCol=None, mode='mixed', seed=1518410069)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • labelCol (str) – The name of the label column

  • mode (str) – Specify equal to repartition with replacement across all labels, specify original to keep the ratios in the original dataset, or specify mixed to use a heuristic

  • seed (long) – random seed

static getJavaPackage()[source]

Returns package name String.

getLabelCol()[source]
Returns

The name of the label column

Return type

labelCol

getMode()[source]
Returns

Specify equal to repartition with replacement across all labels, specify original to keep the ratios in the original dataset, or specify mixed to use a heuristic

Return type

mode

getSeed()[source]
Returns

random seed

Return type

seed

labelCol = Param(parent='undefined', name='labelCol', doc='The name of the label column')
mode = Param(parent='undefined', name='mode', doc='Specify equal to repartition with replacement across all labels, specify original to keep the ratios in the original dataset, or specify mixed to use a heuristic')
classmethod read()[source]

Returns an MLReader instance for this class.

seed = Param(parent='undefined', name='seed', doc='random seed')
setLabelCol(value)[source]
Parameters

labelCol – The name of the label column

setMode(value)[source]
Parameters

mode – Specify equal to repartition with replacement across all labels, specify original to keep the ratios in the original dataset, or specify mixed to use a heuristic

setParams(labelCol=None, mode='mixed', seed=1518410069)[source]

Set the (keyword only) parameters

setSeed(value)[source]
Parameters

seed – random seed

synapse.ml.stages.SummarizeData module

class synapse.ml.stages.SummarizeData.SummarizeData(java_obj=None, basic=True, counts=True, errorThreshold=0.0, percentiles=True, sample=True)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • basic (bool) – Compute basic statistics

  • counts (bool) – Compute count statistics

  • errorThreshold (float) – Threshold for quantiles - 0 is exact

  • percentiles (bool) – Compute percentiles

  • sample (bool) – Compute sample statistics

basic = Param(parent='undefined', name='basic', doc='Compute basic statistics')
counts = Param(parent='undefined', name='counts', doc='Compute count statistics')
errorThreshold = Param(parent='undefined', name='errorThreshold', doc='Threshold for quantiles - 0 is exact')
getBasic()[source]
Returns

Compute basic statistics

Return type

basic

getCounts()[source]
Returns

Compute count statistics

Return type

counts

getErrorThreshold()[source]
Returns

Threshold for quantiles - 0 is exact

Return type

errorThreshold

static getJavaPackage()[source]

Returns package name String.

getPercentiles()[source]
Returns

Compute percentiles

Return type

percentiles

getSample()[source]
Returns

Compute sample statistics

Return type

sample

percentiles = Param(parent='undefined', name='percentiles', doc='Compute percentiles')
classmethod read()[source]

Returns an MLReader instance for this class.

sample = Param(parent='undefined', name='sample', doc='Compute sample statistics')
setBasic(value)[source]
Parameters

basic – Compute basic statistics

setCounts(value)[source]
Parameters

counts – Compute count statistics

setErrorThreshold(value)[source]
Parameters

errorThreshold – Threshold for quantiles - 0 is exact

setParams(basic=True, counts=True, errorThreshold=0.0, percentiles=True, sample=True)[source]

Set the (keyword only) parameters

setPercentiles(value)[source]
Parameters

percentiles – Compute percentiles

setSample(value)[source]
Parameters

sample – Compute sample statistics

synapse.ml.stages.TextPreprocessor module

class synapse.ml.stages.TextPreprocessor.TextPreprocessor(java_obj=None, inputCol=None, map=None, normFunc=None, outputCol=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • inputCol (str) – The name of the input column

  • map (dict) – Map of substring match to replacement

  • normFunc (str) – Name of normalization function to apply

  • outputCol (str) – The name of the output column

getInputCol()[source]
Returns

The name of the input column

Return type

inputCol

static getJavaPackage()[source]

Returns package name String.

getMap()[source]
Returns

Map of substring match to replacement

Return type

map

getNormFunc()[source]
Returns

Name of normalization function to apply

Return type

normFunc

getOutputCol()[source]
Returns

The name of the output column

Return type

outputCol

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
map = Param(parent='undefined', name='map', doc='Map of substring match to replacement')
normFunc = Param(parent='undefined', name='normFunc', doc='Name of normalization function to apply')
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
classmethod read()[source]

Returns an MLReader instance for this class.

setInputCol(value)[source]
Parameters

inputCol – The name of the input column

setMap(value)[source]
Parameters

map – Map of substring match to replacement

setNormFunc(value)[source]
Parameters

normFunc – Name of normalization function to apply

setOutputCol(value)[source]
Parameters

outputCol – The name of the output column

setParams(inputCol=None, map=None, normFunc=None, outputCol=None)[source]

Set the (keyword only) parameters

synapse.ml.stages.TimeIntervalMiniBatchTransformer module

class synapse.ml.stages.TimeIntervalMiniBatchTransformer.TimeIntervalMiniBatchTransformer(java_obj=None, maxBatchSize=2147483647, millisToWait=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • maxBatchSize (int) – The max size of the buffer

  • millisToWait (int) – The time to wait before constructing a batch

static getJavaPackage()[source]

Returns package name String.

getMaxBatchSize()[source]
Returns

The max size of the buffer

Return type

maxBatchSize

getMillisToWait()[source]
Returns

The time to wait before constructing a batch

Return type

millisToWait

maxBatchSize = Param(parent='undefined', name='maxBatchSize', doc='The max size of the buffer')
millisToWait = Param(parent='undefined', name='millisToWait', doc='The time to wait before constructing a batch')
classmethod read()[source]

Returns an MLReader instance for this class.

setMaxBatchSize(value)[source]
Parameters

maxBatchSize – The max size of the buffer

setMillisToWait(value)[source]
Parameters

millisToWait – The time to wait before constructing a batch

setParams(maxBatchSize=2147483647, millisToWait=None)[source]

Set the (keyword only) parameters

synapse.ml.stages.Timer module

class synapse.ml.stages.Timer.Timer(java_obj=None, disableMaterialization=True, logToScala=True, stage=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • disableMaterialization (bool) – Whether to disable timing (so that one can turn it off for evaluation)

  • logToScala (bool) – Whether to output the time to the scala console

  • stage (object) – The stage to time

disableMaterialization = Param(parent='undefined', name='disableMaterialization', doc='Whether to disable timing (so that one can turn it off for evaluation)')
getDisableMaterialization()[source]
Returns

Whether to disable timing (so that one can turn it off for evaluation)

Return type

disableMaterialization

static getJavaPackage()[source]

Returns package name String.

getLogToScala()[source]
Returns

Whether to output the time to the scala console

Return type

logToScala

getStage()[source]
Returns

The stage to time

Return type

stage

logToScala = Param(parent='undefined', name='logToScala', doc='Whether to output the time to the scala console')
classmethod read()[source]

Returns an MLReader instance for this class.

setDisableMaterialization(value)[source]
Parameters

disableMaterialization – Whether to disable timing (so that one can turn it off for evaluation)

setLogToScala(value)[source]
Parameters

logToScala – Whether to output the time to the scala console

setParams(disableMaterialization=True, logToScala=True, stage=None)[source]

Set the (keyword only) parameters

setStage(value)[source]
Parameters

stage – The stage to time

stage = Param(parent='undefined', name='stage', doc='The stage to time')

synapse.ml.stages.TimerModel module

class synapse.ml.stages.TimerModel.TimerModel(java_obj=None, disableMaterialization=True, logToScala=True, stage=None, transformer=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • disableMaterialization (bool) – Whether to disable timing (so that one can turn it off for evaluation)

  • logToScala (bool) – Whether to output the time to the scala console

  • stage (object) – The stage to time

  • transformer (object) – inner model to time

disableMaterialization = Param(parent='undefined', name='disableMaterialization', doc='Whether to disable timing (so that one can turn it off for evaluation)')
getDisableMaterialization()[source]
Returns

Whether to disable timing (so that one can turn it off for evaluation)

Return type

disableMaterialization

static getJavaPackage()[source]

Returns package name String.

getLogToScala()[source]
Returns

Whether to output the time to the scala console

Return type

logToScala

getStage()[source]
Returns

The stage to time

Return type

stage

getTransformer()[source]
Returns

inner model to time

Return type

transformer

logToScala = Param(parent='undefined', name='logToScala', doc='Whether to output the time to the scala console')
classmethod read()[source]

Returns an MLReader instance for this class.

setDisableMaterialization(value)[source]
Parameters

disableMaterialization – Whether to disable timing (so that one can turn it off for evaluation)

setLogToScala(value)[source]
Parameters

logToScala – Whether to output the time to the scala console

setParams(disableMaterialization=True, logToScala=True, stage=None, transformer=None)[source]

Set the (keyword only) parameters

setStage(value)[source]
Parameters

stage – The stage to time

setTransformer(value)[source]
Parameters

transformer – inner model to time

stage = Param(parent='undefined', name='stage', doc='The stage to time')
transformer = Param(parent='undefined', name='transformer', doc='inner model to time')

synapse.ml.stages.UDFTransformer module

class synapse.ml.stages.UDFTransformer.UDFTransformer(inputCol=None, inputCols=None, outputCol=None, udf=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • inputCol (str) – The name of the input column (default: )

  • outputCol (str) – The name of the output column

  • udf (object) – User Defined Python Function to be applied to the DF input col

  • udfScala (object) – User Defined Function to be applied to the DF input col

getInputCol()[source]
Returns

The name of the input column (default: )

Return type

str

getInputCols()[source]
Returns

The name of the input column (default: )

Return type

str

static getJavaPackage()[source]

Returns package name String.

getOutputCol()[source]
Returns

The name of the output column

Return type

str

getUDF()[source]
classmethod read()[source]

Returns an MLReader instance for this class.

setInputCol(value)[source]
Parameters

inputCol (str) – The name of the input column (default: )

setInputCols(value)[source]
Parameters

inputCols (list) – The names of the input columns (default: )

setOutputCol(value)[source]
Parameters

outputCol (str) – The name of the output column

setUDF(udf)[source]

synapse.ml.stages.UnicodeNormalize module

class synapse.ml.stages.UnicodeNormalize.UnicodeNormalize(java_obj=None, form=None, inputCol=None, lower=None, outputCol=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • form (str) – Unicode normalization form: NFC, NFD, NFKC, NFKD

  • inputCol (str) – The name of the input column

  • lower (bool) – Lowercase text

  • outputCol (str) – The name of the output column

form = Param(parent='undefined', name='form', doc='Unicode normalization form: NFC, NFD, NFKC, NFKD')
getForm()[source]
Returns

Unicode normalization form: NFC, NFD, NFKC, NFKD

Return type

form

getInputCol()[source]
Returns

The name of the input column

Return type

inputCol

static getJavaPackage()[source]

Returns package name String.

getLower()[source]
Returns

Lowercase text

Return type

lower

getOutputCol()[source]
Returns

The name of the output column

Return type

outputCol

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
lower = Param(parent='undefined', name='lower', doc='Lowercase text')
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
classmethod read()[source]

Returns an MLReader instance for this class.

setForm(value)[source]
Parameters

form – Unicode normalization form: NFC, NFD, NFKC, NFKD

setInputCol(value)[source]
Parameters

inputCol – The name of the input column

setLower(value)[source]
Parameters

lower – Lowercase text

setOutputCol(value)[source]
Parameters

outputCol – The name of the output column

setParams(form=None, inputCol=None, lower=None, outputCol=None)[source]

Set the (keyword only) parameters

Module contents

SynapseML is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark in several new directions. SynapseML adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV. These tools enable powerful and highly-scalable predictive and analytical models for a variety of datasources.

SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. In this vein, SynapseML provides easy to use SparkML transformers for a wide variety of Microsoft Cognitive Services. For production grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.

SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+.