mmlspark.featurize package¶

Subpackages¶

mmlspark.featurize.text package

Submodules¶

mmlspark.featurize.AssembleFeatures module¶

class mmlspark.featurize.AssembleFeatures.AssembleFeatures(allowImages=False, columnsToFeaturize=None, featuresCol='features', numberOfFeatures=None, oneHotEncodeCategoricals=True)[source]¶

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaEstimator

Parameters

allowImages (bool) – Allow featurization of images (default: false)
columnsToFeaturize (list) – Columns to featurize
featuresCol (str) – The name of the features column (default: features)
numberOfFeatures (int) – Number of features to hash string columns to
oneHotEncodeCategoricals (bool) – One-hot encode categoricals (default: true)

getAllowImages()[source]¶

Returns: Allow featurization of images (default: false)
Return type: bool

getColumnsToFeaturize()[source]¶

Returns: Columns to featurize
Return type: list

getFeaturesCol()[source]¶

Returns: The name of the features column (default: features)
Return type: str

static getJavaPackage()[source]¶: Returns package name String.

getNumberOfFeatures()[source]¶

Returns: Number of features to hash string columns to
Return type: int

getOneHotEncodeCategoricals()[source]¶

Returns: One-hot encode categoricals (default: true)
Return type: bool

classmethod read()[source]¶: Returns an MLReader instance for this class.

setAllowImages(value)[source]¶

Parameters: allowImages (bool) – Allow featurization of images (default: false)

setColumnsToFeaturize(value)[source]¶

Parameters: columnsToFeaturize (list) – Columns to featurize

setFeaturesCol(value)[source]¶

Parameters: featuresCol (str) – The name of the features column (default: features)

setNumberOfFeatures(value)[source]¶

Parameters: numberOfFeatures (int) – Number of features to hash string columns to

setOneHotEncodeCategoricals(value)[source]¶

Parameters: oneHotEncodeCategoricals (bool) – One-hot encode categoricals (default: true)

setParams(allowImages=False, columnsToFeaturize=None, featuresCol='features', numberOfFeatures=None, oneHotEncodeCategoricals=True)[source]¶

Set the (keyword only) parameters

Parameters

allowImages (bool) – Allow featurization of images (default: false)
columnsToFeaturize (list) – Columns to featurize
featuresCol (str) – The name of the features column (default: features)
numberOfFeatures (int) – Number of features to hash string columns to
oneHotEncodeCategoricals (bool) – One-hot encode categoricals (default: true)

class mmlspark.featurize.AssembleFeatures.AssembleFeaturesModel(java_model=None)[source]¶

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.wrapper.JavaModel, pyspark.ml.util.JavaMLWritable, pyspark.ml.util.JavaMLReadable

Model fitted by AssembleFeatures.

This class is left empty on purpose. All necessary methods are exposed through inheritance.

static getJavaPackage()[source]¶: Returns package name String.

classmethod read()[source]¶: Returns an MLReader instance for this class.

mmlspark.featurize.CleanMissingData module¶

class mmlspark.featurize.CleanMissingData.CleanMissingData(cleaningMode='Mean', customValue=None, inputCols=None, outputCols=None)[source]¶

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaEstimator

Parameters

cleaningMode (str) – Cleaning mode (default: Mean)
customValue (str) – Custom value for replacement
inputCols (list) – The names of the input columns
outputCols (list) – The names of the output columns

getCleaningMode()[source]¶

Returns: Cleaning mode (default: Mean)
Return type: str

getCustomValue()[source]¶

Returns: Custom value for replacement
Return type: str

getInputCols()[source]¶

Returns: The names of the input columns
Return type: list

static getJavaPackage()[source]¶: Returns package name String.

getOutputCols()[source]¶

Returns: The names of the output columns
Return type: list

classmethod read()[source]¶: Returns an MLReader instance for this class.

setCleaningMode(value)[source]¶

Parameters: cleaningMode (str) – Cleaning mode (default: Mean)

setCustomValue(value)[source]¶

Parameters: customValue (str) – Custom value for replacement

setInputCols(value)[source]¶

Parameters: inputCols (list) – The names of the input columns

setOutputCols(value)[source]¶

Parameters: outputCols (list) – The names of the output columns

setParams(cleaningMode='Mean', customValue=None, inputCols=None, outputCols=None)[source]¶

Set the (keyword only) parameters

Parameters

cleaningMode (str) – Cleaning mode (default: Mean)
customValue (str) – Custom value for replacement
inputCols (list) – The names of the input columns
outputCols (list) – The names of the output columns

class mmlspark.featurize.CleanMissingData.CleanMissingDataModel(java_model=None)[source]¶

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.wrapper.JavaModel, pyspark.ml.util.JavaMLWritable, pyspark.ml.util.JavaMLReadable

Model fitted by CleanMissingData.

This class is left empty on purpose. All necessary methods are exposed through inheritance.

static getJavaPackage()[source]¶: Returns package name String.

classmethod read()[source]¶: Returns an MLReader instance for this class.

mmlspark.featurize.DataConversion module¶

class mmlspark.featurize.DataConversion.DataConversion(cols=None, convertTo='', dateTimeFormat='yyyy-MM-dd HH:mm:ss')[source]¶

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters

cols (list) – Comma separated list of columns whose type will be converted
convertTo (str) – The result type (default: )
dateTimeFormat (str) – Format for DateTime when making DateTime:String conversions (default: yyyy-MM-dd HH:mm:ss)

getCols()[source]¶

Returns: Comma separated list of columns whose type will be converted
Return type: list

getConvertTo()[source]¶

Returns: The result type (default: )
Return type: str

getDateTimeFormat()[source]¶

Returns: Format for DateTime when making DateTime:String conversions (default: yyyy-MM-dd HH:mm:ss)
Return type: str

static getJavaPackage()[source]¶: Returns package name String.

classmethod read()[source]¶: Returns an MLReader instance for this class.

setCols(value)[source]¶

Parameters: cols (list) – Comma separated list of columns whose type will be converted

setConvertTo(value)[source]¶

Parameters: convertTo (str) – The result type (default: )

setDateTimeFormat(value)[source]¶

Parameters: dateTimeFormat (str) – Format for DateTime when making DateTime:String conversions (default: yyyy-MM-dd HH:mm:ss)

setParams(cols=None, convertTo='', dateTimeFormat='yyyy-MM-dd HH:mm:ss')[source]¶

Set the (keyword only) parameters

Parameters

cols (list) – Comma separated list of columns whose type will be converted
convertTo (str) – The result type (default: )
dateTimeFormat (str) – Format for DateTime when making DateTime:String conversions (default: yyyy-MM-dd HH:mm:ss)

mmlspark.featurize.Featurize module¶

class mmlspark.featurize.Featurize.Featurize(allowImages=False, featureColumns=None, numberOfFeatures=262144, oneHotEncodeCategoricals=True)[source]¶

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaEstimator

Parameters

allowImages (bool) – Allow featurization of images (default: false)
featureColumns (dict) – Feature columns
numberOfFeatures (int) – Number of features to hash string columns to (default: 262144)
oneHotEncodeCategoricals (bool) – One-hot encode categoricals (default: true)

getAllowImages()[source]¶

Returns: Allow featurization of images (default: false)
Return type: bool

getFeatureColumns()[source]¶

Returns: Feature columns
Return type: dict

static getJavaPackage()[source]¶: Returns package name String.

getNumberOfFeatures()[source]¶

Returns: Number of features to hash string columns to (default: 262144)
Return type: int

getOneHotEncodeCategoricals()[source]¶

Returns: One-hot encode categoricals (default: true)
Return type: bool

classmethod read()[source]¶: Returns an MLReader instance for this class.

setAllowImages(value)[source]¶

Parameters: allowImages (bool) – Allow featurization of images (default: false)

setFeatureColumns(value)[source]¶

Parameters: featureColumns (dict) – Feature columns

setNumberOfFeatures(value)[source]¶

Parameters: numberOfFeatures (int) – Number of features to hash string columns to (default: 262144)

setOneHotEncodeCategoricals(value)[source]¶

Parameters: oneHotEncodeCategoricals (bool) – One-hot encode categoricals (default: true)

setParams(allowImages=False, featureColumns=None, numberOfFeatures=262144, oneHotEncodeCategoricals=True)[source]¶

Set the (keyword only) parameters

Parameters

allowImages (bool) – Allow featurization of images (default: false)
featureColumns (dict) – Feature columns
numberOfFeatures (int) – Number of features to hash string columns to (default: 262144)
oneHotEncodeCategoricals (bool) – One-hot encode categoricals (default: true)

class mmlspark.featurize.Featurize.PipelineModel(java_model=None)[source]¶

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.wrapper.JavaModel, pyspark.ml.util.JavaMLWritable, pyspark.ml.util.JavaMLReadable

Model fitted by Featurize.

This class is left empty on purpose. All necessary methods are exposed through inheritance.

static getJavaPackage()[source]¶: Returns package name String.

classmethod read()[source]¶: Returns an MLReader instance for this class.

mmlspark.featurize.IndexToValue module¶

class mmlspark.featurize.IndexToValue.IndexToValue(inputCol=None, outputCol=None)[source]¶

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters

inputCol (str) – The name of the input column
outputCol (str) – The name of the output column

getInputCol()[source]¶

Returns: The name of the input column
Return type: str

static getJavaPackage()[source]¶: Returns package name String.

getOutputCol()[source]¶

Returns: The name of the output column
Return type: str

classmethod read()[source]¶: Returns an MLReader instance for this class.

setInputCol(value)[source]¶

Parameters: inputCol (str) – The name of the input column

setOutputCol(value)[source]¶

Parameters: outputCol (str) – The name of the output column

setParams(inputCol=None, outputCol=None)[source]¶

Set the (keyword only) parameters

Parameters

inputCol (str) – The name of the input column
outputCol (str) – The name of the output column

mmlspark.featurize.ValueIndexer module¶

class mmlspark.featurize.ValueIndexer.ValueIndexer(inputCol=None, outputCol=None)[source]¶

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaEstimator

Parameters

inputCol (str) – The name of the input column
outputCol (str) – The name of the output column

getInputCol()[source]¶

Returns: The name of the input column
Return type: str

static getJavaPackage()[source]¶: Returns package name String.

getOutputCol()[source]¶

Returns: The name of the output column
Return type: str

classmethod read()[source]¶: Returns an MLReader instance for this class.

setInputCol(value)[source]¶

Parameters: inputCol (str) – The name of the input column

setOutputCol(value)[source]¶

Parameters: outputCol (str) – The name of the output column

setParams(inputCol=None, outputCol=None)[source]¶

Set the (keyword only) parameters

Parameters

inputCol (str) – The name of the input column
outputCol (str) – The name of the output column

class mmlspark.featurize.ValueIndexer.ValueIndexerModel(java_model=None)[source]¶

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.wrapper.JavaModel, pyspark.ml.util.JavaMLWritable, pyspark.ml.util.JavaMLReadable

Model fitted by ValueIndexer.

This class is left empty on purpose. All necessary methods are exposed through inheritance.

static getJavaPackage()[source]¶: Returns package name String.

classmethod read()[source]¶: Returns an MLReader instance for this class.

mmlspark.featurize.ValueIndexerModel module¶

class mmlspark.featurize.ValueIndexerModel.ValueIndexerModel(dataType='string', inputCol='input', levels=None, outputCol=None)[source]¶

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters

dataType (str) – The datatype of the levels as a Json string (default: string)
inputCol (str) – The name of the input column (default: input)
levels (object) – Levels in categorical array
outputCol (str) – The name of the output column (default: [self.uid]_output)

getDataType()[source]¶

Returns: The datatype of the levels as a Json string (default: string)
Return type: str

getInputCol()[source]¶

Returns: The name of the input column (default: input)
Return type: str

static getJavaPackage()[source]¶: Returns package name String.

getLevels()[source]¶

Returns: Levels in categorical array
Return type: object

getOutputCol()[source]¶

Returns: The name of the output column (default: [self.uid]_output)
Return type: str

classmethod read()[source]¶: Returns an MLReader instance for this class.

setDataType(value)[source]¶

Parameters: dataType (str) – The datatype of the levels as a Json string (default: string)

setInputCol(value)[source]¶

Parameters: inputCol (str) – The name of the input column (default: input)

setLevels(value)[source]¶

Parameters: levels (object) – Levels in categorical array

setOutputCol(value)[source]¶

Parameters: outputCol (str) – The name of the output column (default: [self.uid]_output)

setParams(dataType='string', inputCol='input', levels=None, outputCol=None)[source]¶

Set the (keyword only) parameters

Parameters

dataType (str) – The datatype of the levels as a Json string (default: string)
inputCol (str) – The name of the input column (default: input)
levels (object) – Levels in categorical array
outputCol (str) – The name of the output column (default: [self.uid]_output)

Module contents¶

MicrosoftML is a library of Python classes to interface with the Microsoft scala APIs to utilize Apache Spark to create distibuted machine learning models.

MicrosoftML simplifies training and scoring classifiers and regressors, as well as facilitating the creation of models using the CNTK library, images, and text.