synapse.ml.featurize package

Subpackages

Submodules

synapse.ml.featurize.CleanMissingData module

class synapse.ml.featurize.CleanMissingData.CleanMissingData(java_obj=None, cleaningMode='Mean', customValue=None, inputCols=None, outputCols=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • cleaningMode (str) – Cleaning mode

  • customValue (str) – Custom value for replacement

  • inputCols (list) – The names of the input columns

  • outputCols (list) – The names of the output columns

cleaningMode = Param(parent='undefined', name='cleaningMode', doc='Cleaning mode')
customValue = Param(parent='undefined', name='customValue', doc='Custom value for replacement')
getCleaningMode()[source]
Returns

Cleaning mode

Return type

cleaningMode

getCustomValue()[source]
Returns

Custom value for replacement

Return type

customValue

getInputCols()[source]
Returns

The names of the input columns

Return type

inputCols

static getJavaPackage()[source]

Returns package name String.

getOutputCols()[source]
Returns

The names of the output columns

Return type

outputCols

inputCols = Param(parent='undefined', name='inputCols', doc='The names of the input columns')
outputCols = Param(parent='undefined', name='outputCols', doc='The names of the output columns')
classmethod read()[source]

Returns an MLReader instance for this class.

setCleaningMode(value)[source]
Parameters

cleaningMode – Cleaning mode

setCustomValue(value)[source]
Parameters

customValue – Custom value for replacement

setInputCols(value)[source]
Parameters

inputCols – The names of the input columns

setOutputCols(value)[source]
Parameters

outputCols – The names of the output columns

setParams(cleaningMode='Mean', customValue=None, inputCols=None, outputCols=None)[source]

Set the (keyword only) parameters

synapse.ml.featurize.CleanMissingDataModel module

class synapse.ml.featurize.CleanMissingDataModel.CleanMissingDataModel(java_obj=None, colsToFill=None, fillValues=None, inputCols=None, outputCols=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • colsToFill (list) – The columns to fill with

  • fillValues (object) – what to replace in the columns

  • inputCols (list) – The names of the input columns

  • outputCols (list) – The names of the output columns

colsToFill = Param(parent='undefined', name='colsToFill', doc='The columns to fill with')
fillValues = Param(parent='undefined', name='fillValues', doc='what to replace in the columns')
getColsToFill()[source]
Returns

The columns to fill with

Return type

colsToFill

getFillValues()[source]
Returns

what to replace in the columns

Return type

fillValues

getInputCols()[source]
Returns

The names of the input columns

Return type

inputCols

static getJavaPackage()[source]

Returns package name String.

getOutputCols()[source]
Returns

The names of the output columns

Return type

outputCols

inputCols = Param(parent='undefined', name='inputCols', doc='The names of the input columns')
outputCols = Param(parent='undefined', name='outputCols', doc='The names of the output columns')
classmethod read()[source]

Returns an MLReader instance for this class.

setColsToFill(value)[source]
Parameters

colsToFill – The columns to fill with

setFillValues(value)[source]
Parameters

fillValues – what to replace in the columns

setInputCols(value)[source]
Parameters

inputCols – The names of the input columns

setOutputCols(value)[source]
Parameters

outputCols – The names of the output columns

setParams(colsToFill=None, fillValues=None, inputCols=None, outputCols=None)[source]

Set the (keyword only) parameters

synapse.ml.featurize.CountSelector module

class synapse.ml.featurize.CountSelector.CountSelector(java_obj=None, inputCol=None, outputCol=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • inputCol (str) – The name of the input column

  • outputCol (str) – The name of the output column

getInputCol()[source]
Returns

The name of the input column

Return type

inputCol

static getJavaPackage()[source]

Returns package name String.

getOutputCol()[source]
Returns

The name of the output column

Return type

outputCol

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
classmethod read()[source]

Returns an MLReader instance for this class.

setInputCol(value)[source]
Parameters

inputCol – The name of the input column

setOutputCol(value)[source]
Parameters

outputCol – The name of the output column

setParams(inputCol=None, outputCol=None)[source]

Set the (keyword only) parameters

synapse.ml.featurize.CountSelectorModel module

class synapse.ml.featurize.CountSelectorModel.CountSelectorModel(java_obj=None, indices=None, inputCol=None, outputCol=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • indices (list) – An array of indices to select features from a vector column. There can be no overlap with names.

  • inputCol (str) – The name of the input column

  • outputCol (str) – The name of the output column

getIndices()[source]
Returns

An array of indices to select features from a vector column. There can be no overlap with names.

Return type

indices

getInputCol()[source]
Returns

The name of the input column

Return type

inputCol

static getJavaPackage()[source]

Returns package name String.

getOutputCol()[source]
Returns

The name of the output column

Return type

outputCol

indices = Param(parent='undefined', name='indices', doc='An array of indices to select features from a vector column. There can be no overlap with names.')
inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
classmethod read()[source]

Returns an MLReader instance for this class.

setIndices(value)[source]
Parameters

indices – An array of indices to select features from a vector column. There can be no overlap with names.

setInputCol(value)[source]
Parameters

inputCol – The name of the input column

setOutputCol(value)[source]
Parameters

outputCol – The name of the output column

setParams(indices=None, inputCol=None, outputCol=None)[source]

Set the (keyword only) parameters

synapse.ml.featurize.DataConversion module

class synapse.ml.featurize.DataConversion.DataConversion(java_obj=None, cols=None, convertTo='', dateTimeFormat='yyyy-MM-dd HH:mm:ss')[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • cols (list) – Comma separated list of columns whose type will be converted

  • convertTo (str) – The result type

  • dateTimeFormat (str) – Format for DateTime when making DateTime:String conversions

cols = Param(parent='undefined', name='cols', doc='Comma separated list of columns whose type will be converted')
convertTo = Param(parent='undefined', name='convertTo', doc='The result type')
dateTimeFormat = Param(parent='undefined', name='dateTimeFormat', doc='Format for DateTime when making DateTime:String conversions')
getCols()[source]
Returns

Comma separated list of columns whose type will be converted

Return type

cols

getConvertTo()[source]
Returns

The result type

Return type

convertTo

getDateTimeFormat()[source]
Returns

Format for DateTime when making DateTime:String conversions

Return type

dateTimeFormat

static getJavaPackage()[source]

Returns package name String.

classmethod read()[source]

Returns an MLReader instance for this class.

setCols(value)[source]
Parameters

cols – Comma separated list of columns whose type will be converted

setConvertTo(value)[source]
Parameters

convertTo – The result type

setDateTimeFormat(value)[source]
Parameters

dateTimeFormat – Format for DateTime when making DateTime:String conversions

setParams(cols=None, convertTo='', dateTimeFormat='yyyy-MM-dd HH:mm:ss')[source]

Set the (keyword only) parameters

synapse.ml.featurize.Featurize module

class synapse.ml.featurize.Featurize.Featurize(java_obj=None, imputeMissing=True, inputCols=None, numFeatures=262144, oneHotEncodeCategoricals=True, outputCol=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • imputeMissing (bool) – Whether to impute missing values

  • inputCols (list) – The names of the input columns

  • numFeatures (int) – Number of features to hash string columns to

  • oneHotEncodeCategoricals (bool) – One-hot encode categorical columns

  • outputCol (str) – The name of the output column

getImputeMissing()[source]
Returns

Whether to impute missing values

Return type

imputeMissing

getInputCols()[source]
Returns

The names of the input columns

Return type

inputCols

static getJavaPackage()[source]

Returns package name String.

getNumFeatures()[source]
Returns

Number of features to hash string columns to

Return type

numFeatures

getOneHotEncodeCategoricals()[source]
Returns

One-hot encode categorical columns

Return type

oneHotEncodeCategoricals

getOutputCol()[source]
Returns

The name of the output column

Return type

outputCol

imputeMissing = Param(parent='undefined', name='imputeMissing', doc='Whether to impute missing values')
inputCols = Param(parent='undefined', name='inputCols', doc='The names of the input columns')
numFeatures = Param(parent='undefined', name='numFeatures', doc='Number of features to hash string columns to')
oneHotEncodeCategoricals = Param(parent='undefined', name='oneHotEncodeCategoricals', doc='One-hot encode categorical columns')
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
classmethod read()[source]

Returns an MLReader instance for this class.

setImputeMissing(value)[source]
Parameters

imputeMissing – Whether to impute missing values

setInputCols(value)[source]
Parameters

inputCols – The names of the input columns

setNumFeatures(value)[source]
Parameters

numFeatures – Number of features to hash string columns to

setOneHotEncodeCategoricals(value)[source]
Parameters

oneHotEncodeCategoricals – One-hot encode categorical columns

setOutputCol(value)[source]
Parameters

outputCol – The name of the output column

setParams(imputeMissing=True, inputCols=None, numFeatures=262144, oneHotEncodeCategoricals=True, outputCol=None)[source]

Set the (keyword only) parameters

synapse.ml.featurize.IndexToValue module

class synapse.ml.featurize.IndexToValue.IndexToValue(java_obj=None, inputCol=None, outputCol=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • inputCol (str) – The name of the input column

  • outputCol (str) – The name of the output column

getInputCol()[source]
Returns

The name of the input column

Return type

inputCol

static getJavaPackage()[source]

Returns package name String.

getOutputCol()[source]
Returns

The name of the output column

Return type

outputCol

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
classmethod read()[source]

Returns an MLReader instance for this class.

setInputCol(value)[source]
Parameters

inputCol – The name of the input column

setOutputCol(value)[source]
Parameters

outputCol – The name of the output column

setParams(inputCol=None, outputCol=None)[source]

Set the (keyword only) parameters

synapse.ml.featurize.ValueIndexer module

class synapse.ml.featurize.ValueIndexer.ValueIndexer(java_obj=None, inputCol=None, outputCol=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • inputCol (str) – The name of the input column

  • outputCol (str) – The name of the output column

getInputCol()[source]
Returns

The name of the input column

Return type

inputCol

static getJavaPackage()[source]

Returns package name String.

getOutputCol()[source]
Returns

The name of the output column

Return type

outputCol

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
classmethod read()[source]

Returns an MLReader instance for this class.

setInputCol(value)[source]
Parameters

inputCol – The name of the input column

setOutputCol(value)[source]
Parameters

outputCol – The name of the output column

setParams(inputCol=None, outputCol=None)[source]

Set the (keyword only) parameters

synapse.ml.featurize.ValueIndexerModel module

class synapse.ml.featurize.ValueIndexerModel.ValueIndexerModel(java_obj=None, dataType='string', inputCol='input', levels=None, outputCol='ValueIndexerModel_b153a4cbe70d_output')[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters
  • dataType (str) – The datatype of the levels as a Json string

  • inputCol (str) – The name of the input column

  • levels (object) – Levels in categorical array

  • outputCol (str) – The name of the output column

dataType = Param(parent='undefined', name='dataType', doc='The datatype of the levels as a Json string')
getDataType()[source]
Returns

The datatype of the levels as a Json string

Return type

dataType

getInputCol()[source]
Returns

The name of the input column

Return type

inputCol

static getJavaPackage()[source]

Returns package name String.

getLevels()[source]
Returns

Levels in categorical array

Return type

levels

getOutputCol()[source]
Returns

The name of the output column

Return type

outputCol

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
levels = Param(parent='undefined', name='levels', doc='Levels in categorical array')
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
classmethod read()[source]

Returns an MLReader instance for this class.

setDataType(value)[source]
Parameters

dataType – The datatype of the levels as a Json string

setInputCol(value)[source]
Parameters

inputCol – The name of the input column

setLevels(value)[source]
Parameters

levels – Levels in categorical array

setOutputCol(value)[source]
Parameters

outputCol – The name of the output column

setParams(dataType='string', inputCol='input', levels=None, outputCol='ValueIndexerModel_b153a4cbe70d_output')[source]

Set the (keyword only) parameters

Module contents

SynapseML is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark in several new directions. SynapseML adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV. These tools enable powerful and highly-scalable predictive and analytical models for a variety of datasources.

SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. In this vein, SynapseML provides easy to use SparkML transformers for a wide variety of Microsoft Cognitive Services. For production grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.

SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+.