synapse.ml.featurize package

Subpackages

synapse.ml.featurize.text package

Submodules

synapse.ml.featurize.CleanMissingData module

class synapse.ml.featurize.CleanMissingData.CleanMissingData(java_obj=None, cleaningMode='Mean', customValue=None, inputCols=None, outputCols=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters

cleaningMode¶ (str) – Cleaning mode
customValue¶ (str) – Custom value for replacement
inputCols¶ (list) – The names of the input columns
outputCols¶ (list) – The names of the output columns

cleaningMode = Param(parent='undefined', name='cleaningMode', doc='Cleaning mode')

customValue = Param(parent='undefined', name='customValue', doc='Custom value for replacement')

getCleaningMode()[source]

Returns: Cleaning mode
Return type: cleaningMode

getCustomValue()[source]

Returns: Custom value for replacement
Return type: customValue

getInputCols()[source]

Returns: The names of the input columns
Return type: inputCols

static getJavaPackage()[source]: Returns package name String.

getOutputCols()[source]

Returns: The names of the output columns
Return type: outputCols

inputCols = Param(parent='undefined', name='inputCols', doc='The names of the input columns')

outputCols = Param(parent='undefined', name='outputCols', doc='The names of the output columns')

classmethod read()[source]: Returns an MLReader instance for this class.

setCleaningMode(value)[source]

Parameters: cleaningMode¶ – Cleaning mode

setCustomValue(value)[source]

Parameters: customValue¶ – Custom value for replacement

setInputCols(value)[source]

Parameters: inputCols¶ – The names of the input columns

setOutputCols(value)[source]

Parameters: outputCols¶ – The names of the output columns

setParams(cleaningMode='Mean', customValue=None, inputCols=None, outputCols=None)[source]: Set the (keyword only) parameters

synapse.ml.featurize.CleanMissingDataModel module

class synapse.ml.featurize.CleanMissingDataModel.CleanMissingDataModel(java_obj=None, colsToFill=None, fillValues=None, inputCols=None, outputCols=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters

colsToFill¶ (list) – The columns to fill with
fillValues¶ (object) – what to replace in the columns
inputCols¶ (list) – The names of the input columns
outputCols¶ (list) – The names of the output columns

colsToFill = Param(parent='undefined', name='colsToFill', doc='The columns to fill with')

fillValues = Param(parent='undefined', name='fillValues', doc='what to replace in the columns')

getColsToFill()[source]

Returns: The columns to fill with
Return type: colsToFill

getFillValues()[source]

Returns: what to replace in the columns
Return type: fillValues

getInputCols()[source]

Returns: The names of the input columns
Return type: inputCols

static getJavaPackage()[source]: Returns package name String.

getOutputCols()[source]

Returns: The names of the output columns
Return type: outputCols

inputCols = Param(parent='undefined', name='inputCols', doc='The names of the input columns')

outputCols = Param(parent='undefined', name='outputCols', doc='The names of the output columns')

classmethod read()[source]: Returns an MLReader instance for this class.

setColsToFill(value)[source]

Parameters: colsToFill¶ – The columns to fill with

setFillValues(value)[source]

Parameters: fillValues¶ – what to replace in the columns

setInputCols(value)[source]

Parameters: inputCols¶ – The names of the input columns

setOutputCols(value)[source]

Parameters: outputCols¶ – The names of the output columns

setParams(colsToFill=None, fillValues=None, inputCols=None, outputCols=None)[source]: Set the (keyword only) parameters

synapse.ml.featurize.CountSelector module

class synapse.ml.featurize.CountSelector.CountSelector(java_obj=None, inputCol=None, outputCol=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters

inputCol¶ (str) – The name of the input column
outputCol¶ (str) – The name of the output column

getInputCol()[source]

Returns: The name of the input column
Return type: inputCol

static getJavaPackage()[source]: Returns package name String.

getOutputCol()[source]

Returns: The name of the output column
Return type: outputCol

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')

outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')

classmethod read()[source]: Returns an MLReader instance for this class.

setInputCol(value)[source]

Parameters: inputCol¶ – The name of the input column

setOutputCol(value)[source]

Parameters: outputCol¶ – The name of the output column

setParams(inputCol=None, outputCol=None)[source]: Set the (keyword only) parameters

synapse.ml.featurize.CountSelectorModel module

class synapse.ml.featurize.CountSelectorModel.CountSelectorModel(java_obj=None, indices=None, inputCol=None, outputCol=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters

indices¶ (list) – An array of indices to select features from a vector column. There can be no overlap with names.
inputCol¶ (str) – The name of the input column
outputCol¶ (str) – The name of the output column

getIndices()[source]

Returns: An array of indices to select features from a vector column. There can be no overlap with names.
Return type: indices

getInputCol()[source]

Returns: The name of the input column
Return type: inputCol

static getJavaPackage()[source]: Returns package name String.

getOutputCol()[source]

Returns: The name of the output column
Return type: outputCol

indices = Param(parent='undefined', name='indices', doc='An array of indices to select features from a vector column. There can be no overlap with names.')

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')

outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')

classmethod read()[source]: Returns an MLReader instance for this class.

setIndices(value)[source]

Parameters: indices¶ – An array of indices to select features from a vector column. There can be no overlap with names.

setInputCol(value)[source]

Parameters: inputCol¶ – The name of the input column

setOutputCol(value)[source]

Parameters: outputCol¶ – The name of the output column

setParams(indices=None, inputCol=None, outputCol=None)[source]: Set the (keyword only) parameters

synapse.ml.featurize.DataConversion module

class synapse.ml.featurize.DataConversion.DataConversion(java_obj=None, cols=None, convertTo='', dateTimeFormat='yyyy-MM-dd HH:mm:ss')[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters

cols¶ (list) – Comma separated list of columns whose type will be converted
convertTo¶ (str) – The result type
dateTimeFormat¶ (str) – Format for DateTime when making DateTime:String conversions

cols = Param(parent='undefined', name='cols', doc='Comma separated list of columns whose type will be converted')

convertTo = Param(parent='undefined', name='convertTo', doc='The result type')

dateTimeFormat = Param(parent='undefined', name='dateTimeFormat', doc='Format for DateTime when making DateTime:String conversions')

getCols()[source]

Returns: Comma separated list of columns whose type will be converted
Return type: cols

getConvertTo()[source]

Returns: The result type
Return type: convertTo

getDateTimeFormat()[source]

Returns: Format for DateTime when making DateTime:String conversions
Return type: dateTimeFormat

static getJavaPackage()[source]: Returns package name String.

classmethod read()[source]: Returns an MLReader instance for this class.

setCols(value)[source]

Parameters: cols¶ – Comma separated list of columns whose type will be converted

setConvertTo(value)[source]

Parameters: convertTo¶ – The result type

setDateTimeFormat(value)[source]

Parameters: dateTimeFormat¶ – Format for DateTime when making DateTime:String conversions

setParams(cols=None, convertTo='', dateTimeFormat='yyyy-MM-dd HH:mm:ss')[source]: Set the (keyword only) parameters

synapse.ml.featurize.Featurize module

class synapse.ml.featurize.Featurize.Featurize(java_obj=None, imputeMissing=True, inputCols=None, numFeatures=262144, oneHotEncodeCategoricals=True, outputCol=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters

imputeMissing¶ (bool) – Whether to impute missing values
inputCols¶ (list) – The names of the input columns
numFeatures¶ (int) – Number of features to hash string columns to
oneHotEncodeCategoricals¶ (bool) – One-hot encode categorical columns
outputCol¶ (str) – The name of the output column

getImputeMissing()[source]

Returns: Whether to impute missing values
Return type: imputeMissing

getInputCols()[source]

Returns: The names of the input columns
Return type: inputCols

static getJavaPackage()[source]: Returns package name String.

getNumFeatures()[source]

Returns: Number of features to hash string columns to
Return type: numFeatures

getOneHotEncodeCategoricals()[source]

Returns: One-hot encode categorical columns
Return type: oneHotEncodeCategoricals

getOutputCol()[source]

Returns: The name of the output column
Return type: outputCol

imputeMissing = Param(parent='undefined', name='imputeMissing', doc='Whether to impute missing values')

inputCols = Param(parent='undefined', name='inputCols', doc='The names of the input columns')

numFeatures = Param(parent='undefined', name='numFeatures', doc='Number of features to hash string columns to')

oneHotEncodeCategoricals = Param(parent='undefined', name='oneHotEncodeCategoricals', doc='One-hot encode categorical columns')

outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')

classmethod read()[source]: Returns an MLReader instance for this class.

setImputeMissing(value)[source]

Parameters: imputeMissing¶ – Whether to impute missing values

setInputCols(value)[source]

Parameters: inputCols¶ – The names of the input columns

setNumFeatures(value)[source]

Parameters: numFeatures¶ – Number of features to hash string columns to

setOneHotEncodeCategoricals(value)[source]

Parameters: oneHotEncodeCategoricals¶ – One-hot encode categorical columns

setOutputCol(value)[source]

Parameters: outputCol¶ – The name of the output column

setParams(imputeMissing=True, inputCols=None, numFeatures=262144, oneHotEncodeCategoricals=True, outputCol=None)[source]: Set the (keyword only) parameters

synapse.ml.featurize.IndexToValue module

class synapse.ml.featurize.IndexToValue.IndexToValue(java_obj=None, inputCol=None, outputCol=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters

inputCol¶ (str) – The name of the input column
outputCol¶ (str) – The name of the output column

getInputCol()[source]

Returns: The name of the input column
Return type: inputCol

static getJavaPackage()[source]: Returns package name String.

getOutputCol()[source]

Returns: The name of the output column
Return type: outputCol

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')

outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')

classmethod read()[source]: Returns an MLReader instance for this class.

setInputCol(value)[source]

Parameters: inputCol¶ – The name of the input column

setOutputCol(value)[source]

Parameters: outputCol¶ – The name of the output column

setParams(inputCol=None, outputCol=None)[source]: Set the (keyword only) parameters

synapse.ml.featurize.ValueIndexer module

class synapse.ml.featurize.ValueIndexer.ValueIndexer(java_obj=None, inputCol=None, outputCol=None)[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters

inputCol¶ (str) – The name of the input column
outputCol¶ (str) – The name of the output column

getInputCol()[source]

Returns: The name of the input column
Return type: inputCol

static getJavaPackage()[source]: Returns package name String.

getOutputCol()[source]

Returns: The name of the output column
Return type: outputCol

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')

outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')

classmethod read()[source]: Returns an MLReader instance for this class.

setInputCol(value)[source]

Parameters: inputCol¶ – The name of the input column

setOutputCol(value)[source]

Parameters: outputCol¶ – The name of the output column

setParams(inputCol=None, outputCol=None)[source]: Set the (keyword only) parameters

synapse.ml.featurize.ValueIndexerModel module

class synapse.ml.featurize.ValueIndexerModel.ValueIndexerModel(java_obj=None, dataType='string', inputCol='input', levels=None, outputCol='ValueIndexerModel_357e8736f46b_output')[source]

Bases: pyspark.ml.util.MLReadable[pyspark.ml.util.RL]

Parameters

dataType¶ (str) – The datatype of the levels as a Json string
inputCol¶ (str) – The name of the input column
levels¶ (object) – Levels in categorical array
outputCol¶ (str) – The name of the output column

dataType = Param(parent='undefined', name='dataType', doc='The datatype of the levels as a Json string')

getDataType()[source]

Returns: The datatype of the levels as a Json string
Return type: dataType

getInputCol()[source]

Returns: The name of the input column
Return type: inputCol

static getJavaPackage()[source]: Returns package name String.

getLevels()[source]

Returns: Levels in categorical array
Return type: levels

getOutputCol()[source]

Returns: The name of the output column
Return type: outputCol

inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')

levels = Param(parent='undefined', name='levels', doc='Levels in categorical array')

outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')

classmethod read()[source]: Returns an MLReader instance for this class.

setDataType(value)[source]

Parameters: dataType¶ – The datatype of the levels as a Json string

setInputCol(value)[source]

Parameters: inputCol¶ – The name of the input column

setLevels(value)[source]

Parameters: levels¶ – Levels in categorical array

setOutputCol(value)[source]

Parameters: outputCol¶ – The name of the output column

setParams(dataType='string', inputCol='input', levels=None, outputCol='ValueIndexerModel_357e8736f46b_output')[source]: Set the (keyword only) parameters

Module contents

SynapseML is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark in several new directions. SynapseML adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV. These tools enable powerful and highly-scalable predictive and analytical models for a variety of datasources.

SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. In this vein, SynapseML provides easy to use SparkML transformers for a wide variety of Microsoft Cognitive Services. For production grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.

SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+.