synapse.ml.featurize package
Subpackages
Submodules
synapse.ml.featurize.CleanMissingData module
- class synapse.ml.featurize.CleanMissingData.CleanMissingData(java_obj=None, cleaningMode='Mean', customValue=None, inputCols=None, outputCols=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- cleaningMode = Param(parent='undefined', name='cleaningMode', doc='Cleaning mode')
- customValue = Param(parent='undefined', name='customValue', doc='Custom value for replacement')
- inputCols = Param(parent='undefined', name='inputCols', doc='The names of the input columns')
- outputCols = Param(parent='undefined', name='outputCols', doc='The names of the output columns')
synapse.ml.featurize.CleanMissingDataModel module
- class synapse.ml.featurize.CleanMissingDataModel.CleanMissingDataModel(java_obj=None, colsToFill=None, fillValues=None, inputCols=None, outputCols=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- colsToFill = Param(parent='undefined', name='colsToFill', doc='The columns to fill with')
- fillValues = Param(parent='undefined', name='fillValues', doc='what to replace in the columns')
- inputCols = Param(parent='undefined', name='inputCols', doc='The names of the input columns')
- outputCols = Param(parent='undefined', name='outputCols', doc='The names of the output columns')
synapse.ml.featurize.CountSelector module
- class synapse.ml.featurize.CountSelector.CountSelector(java_obj=None, inputCol=None, outputCol=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
synapse.ml.featurize.CountSelectorModel module
- class synapse.ml.featurize.CountSelectorModel.CountSelectorModel(java_obj=None, indices=None, inputCol=None, outputCol=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- getIndices()[source]
- Returns
An array of indices to select features from a vector column. There can be no overlap with names.
- Return type
indices
- indices = Param(parent='undefined', name='indices', doc='An array of indices to select features from a vector column. There can be no overlap with names.')
- inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
synapse.ml.featurize.DataConversion module
- class synapse.ml.featurize.DataConversion.DataConversion(java_obj=None, cols=None, convertTo='', dateTimeFormat='yyyy-MM-dd HH:mm:ss')[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- cols = Param(parent='undefined', name='cols', doc='Comma separated list of columns whose type will be converted')
- convertTo = Param(parent='undefined', name='convertTo', doc='The result type')
- dateTimeFormat = Param(parent='undefined', name='dateTimeFormat', doc='Format for DateTime when making DateTime:String conversions')
- getCols()[source]
- Returns
Comma separated list of columns whose type will be converted
- Return type
cols
- getDateTimeFormat()[source]
- Returns
Format for DateTime when making DateTime:String conversions
- Return type
dateTimeFormat
- setCols(value)[source]
- Parameters
cols¶ – Comma separated list of columns whose type will be converted
synapse.ml.featurize.Featurize module
- class synapse.ml.featurize.Featurize.Featurize(java_obj=None, imputeMissing=True, inputCols=None, numFeatures=262144, oneHotEncodeCategoricals=True, outputCol=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- getNumFeatures()[source]
- Returns
Number of features to hash string columns to
- Return type
numFeatures
- getOneHotEncodeCategoricals()[source]
- Returns
One-hot encode categorical columns
- Return type
oneHotEncodeCategoricals
- imputeMissing = Param(parent='undefined', name='imputeMissing', doc='Whether to impute missing values')
- inputCols = Param(parent='undefined', name='inputCols', doc='The names of the input columns')
- numFeatures = Param(parent='undefined', name='numFeatures', doc='Number of features to hash string columns to')
- oneHotEncodeCategoricals = Param(parent='undefined', name='oneHotEncodeCategoricals', doc='One-hot encode categorical columns')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
- setNumFeatures(value)[source]
- Parameters
numFeatures¶ – Number of features to hash string columns to
synapse.ml.featurize.IndexToValue module
- class synapse.ml.featurize.IndexToValue.IndexToValue(java_obj=None, inputCol=None, outputCol=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
synapse.ml.featurize.ValueIndexer module
- class synapse.ml.featurize.ValueIndexer.ValueIndexer(java_obj=None, inputCol=None, outputCol=None)[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
synapse.ml.featurize.ValueIndexerModel module
- class synapse.ml.featurize.ValueIndexerModel.ValueIndexerModel(java_obj=None, dataType='string', inputCol='input', levels=None, outputCol='ValueIndexerModel_357e8736f46b_output')[source]
Bases:
pyspark.ml.util.MLReadable
[pyspark.ml.util.RL
]- Parameters
- dataType = Param(parent='undefined', name='dataType', doc='The datatype of the levels as a Json string')
- inputCol = Param(parent='undefined', name='inputCol', doc='The name of the input column')
- levels = Param(parent='undefined', name='levels', doc='Levels in categorical array')
- outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column')
Module contents
SynapseML is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark in several new directions. SynapseML adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV. These tools enable powerful and highly-scalable predictive and analytical models for a variety of datasources.
SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. In this vein, SynapseML provides easy to use SparkML transformers for a wide variety of Microsoft Cognitive Services. For production grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.
SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+.