mmlspark.featurize package

Submodules

mmlspark.featurize.AssembleFeatures module

class mmlspark.featurize.AssembleFeatures.AssembleFeatures(allowImages=False, columnsToFeaturize=None, featuresCol='features', numberOfFeatures=None, oneHotEncodeCategoricals=True)[source]

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaEstimator

Parameters
  • allowImages (bool) – Allow featurization of images (default: false)

  • columnsToFeaturize (list) – Columns to featurize

  • featuresCol (str) – The name of the features column (default: features)

  • numberOfFeatures (int) – Number of features to hash string columns to

  • oneHotEncodeCategoricals (bool) – One-hot encode categoricals (default: true)

getAllowImages()[source]
Returns

Allow featurization of images (default: false)

Return type

bool

getColumnsToFeaturize()[source]
Returns

Columns to featurize

Return type

list

getFeaturesCol()[source]
Returns

The name of the features column (default: features)

Return type

str

static getJavaPackage()[source]

Returns package name String.

getNumberOfFeatures()[source]
Returns

Number of features to hash string columns to

Return type

int

getOneHotEncodeCategoricals()[source]
Returns

One-hot encode categoricals (default: true)

Return type

bool

classmethod read()[source]

Returns an MLReader instance for this class.

setAllowImages(value)[source]
Parameters

allowImages (bool) – Allow featurization of images (default: false)

setColumnsToFeaturize(value)[source]
Parameters

columnsToFeaturize (list) – Columns to featurize

setFeaturesCol(value)[source]
Parameters

featuresCol (str) – The name of the features column (default: features)

setNumberOfFeatures(value)[source]
Parameters

numberOfFeatures (int) – Number of features to hash string columns to

setOneHotEncodeCategoricals(value)[source]
Parameters

oneHotEncodeCategoricals (bool) – One-hot encode categoricals (default: true)

setParams(allowImages=False, columnsToFeaturize=None, featuresCol='features', numberOfFeatures=None, oneHotEncodeCategoricals=True)[source]

Set the (keyword only) parameters

Parameters
  • allowImages (bool) – Allow featurization of images (default: false)

  • columnsToFeaturize (list) – Columns to featurize

  • featuresCol (str) – The name of the features column (default: features)

  • numberOfFeatures (int) – Number of features to hash string columns to

  • oneHotEncodeCategoricals (bool) – One-hot encode categoricals (default: true)

class mmlspark.featurize.AssembleFeatures.AssembleFeaturesModel(java_model=None)[source]

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.wrapper.JavaModel, pyspark.ml.util.JavaMLWritable, pyspark.ml.util.JavaMLReadable

Model fitted by AssembleFeatures.

This class is left empty on purpose. All necessary methods are exposed through inheritance.

static getJavaPackage()[source]

Returns package name String.

classmethod read()[source]

Returns an MLReader instance for this class.

mmlspark.featurize.CleanMissingData module

class mmlspark.featurize.CleanMissingData.CleanMissingData(cleaningMode='Mean', customValue=None, inputCols=None, outputCols=None)[source]

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaEstimator

Parameters
  • cleaningMode (str) – Cleaning mode (default: Mean)

  • customValue (str) – Custom value for replacement

  • inputCols (list) – The names of the input columns

  • outputCols (list) – The names of the output columns

getCleaningMode()[source]
Returns

Cleaning mode (default: Mean)

Return type

str

getCustomValue()[source]
Returns

Custom value for replacement

Return type

str

getInputCols()[source]
Returns

The names of the input columns

Return type

list

static getJavaPackage()[source]

Returns package name String.

getOutputCols()[source]
Returns

The names of the output columns

Return type

list

classmethod read()[source]

Returns an MLReader instance for this class.

setCleaningMode(value)[source]
Parameters

cleaningMode (str) – Cleaning mode (default: Mean)

setCustomValue(value)[source]
Parameters

customValue (str) – Custom value for replacement

setInputCols(value)[source]
Parameters

inputCols (list) – The names of the input columns

setOutputCols(value)[source]
Parameters

outputCols (list) – The names of the output columns

setParams(cleaningMode='Mean', customValue=None, inputCols=None, outputCols=None)[source]

Set the (keyword only) parameters

Parameters
  • cleaningMode (str) – Cleaning mode (default: Mean)

  • customValue (str) – Custom value for replacement

  • inputCols (list) – The names of the input columns

  • outputCols (list) – The names of the output columns

class mmlspark.featurize.CleanMissingData.CleanMissingDataModel(java_model=None)[source]

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.wrapper.JavaModel, pyspark.ml.util.JavaMLWritable, pyspark.ml.util.JavaMLReadable

Model fitted by CleanMissingData.

This class is left empty on purpose. All necessary methods are exposed through inheritance.

static getJavaPackage()[source]

Returns package name String.

classmethod read()[source]

Returns an MLReader instance for this class.

mmlspark.featurize.DataConversion module

class mmlspark.featurize.DataConversion.DataConversion(cols=None, convertTo='', dateTimeFormat='yyyy-MM-dd HH:mm:ss')[source]

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters
  • cols (list) – Comma separated list of columns whose type will be converted

  • convertTo (str) – The result type (default: )

  • dateTimeFormat (str) – Format for DateTime when making DateTime:String conversions (default: yyyy-MM-dd HH:mm:ss)

getCols()[source]
Returns

Comma separated list of columns whose type will be converted

Return type

list

getConvertTo()[source]
Returns

The result type (default: )

Return type

str

getDateTimeFormat()[source]
Returns

Format for DateTime when making DateTime:String conversions (default: yyyy-MM-dd HH:mm:ss)

Return type

str

static getJavaPackage()[source]

Returns package name String.

classmethod read()[source]

Returns an MLReader instance for this class.

setCols(value)[source]
Parameters

cols (list) – Comma separated list of columns whose type will be converted

setConvertTo(value)[source]
Parameters

convertTo (str) – The result type (default: )

setDateTimeFormat(value)[source]
Parameters

dateTimeFormat (str) – Format for DateTime when making DateTime:String conversions (default: yyyy-MM-dd HH:mm:ss)

setParams(cols=None, convertTo='', dateTimeFormat='yyyy-MM-dd HH:mm:ss')[source]

Set the (keyword only) parameters

Parameters
  • cols (list) – Comma separated list of columns whose type will be converted

  • convertTo (str) – The result type (default: )

  • dateTimeFormat (str) – Format for DateTime when making DateTime:String conversions (default: yyyy-MM-dd HH:mm:ss)

mmlspark.featurize.Featurize module

class mmlspark.featurize.Featurize.Featurize(allowImages=False, featureColumns=None, numberOfFeatures=262144, oneHotEncodeCategoricals=True)[source]

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaEstimator

Parameters
  • allowImages (bool) – Allow featurization of images (default: false)

  • featureColumns (dict) – Feature columns

  • numberOfFeatures (int) – Number of features to hash string columns to (default: 262144)

  • oneHotEncodeCategoricals (bool) – One-hot encode categoricals (default: true)

getAllowImages()[source]
Returns

Allow featurization of images (default: false)

Return type

bool

getFeatureColumns()[source]
Returns

Feature columns

Return type

dict

static getJavaPackage()[source]

Returns package name String.

getNumberOfFeatures()[source]
Returns

Number of features to hash string columns to (default: 262144)

Return type

int

getOneHotEncodeCategoricals()[source]
Returns

One-hot encode categoricals (default: true)

Return type

bool

classmethod read()[source]

Returns an MLReader instance for this class.

setAllowImages(value)[source]
Parameters

allowImages (bool) – Allow featurization of images (default: false)

setFeatureColumns(value)[source]
Parameters

featureColumns (dict) – Feature columns

setNumberOfFeatures(value)[source]
Parameters

numberOfFeatures (int) – Number of features to hash string columns to (default: 262144)

setOneHotEncodeCategoricals(value)[source]
Parameters

oneHotEncodeCategoricals (bool) – One-hot encode categoricals (default: true)

setParams(allowImages=False, featureColumns=None, numberOfFeatures=262144, oneHotEncodeCategoricals=True)[source]

Set the (keyword only) parameters

Parameters
  • allowImages (bool) – Allow featurization of images (default: false)

  • featureColumns (dict) – Feature columns

  • numberOfFeatures (int) – Number of features to hash string columns to (default: 262144)

  • oneHotEncodeCategoricals (bool) – One-hot encode categoricals (default: true)

class mmlspark.featurize.Featurize.PipelineModel(java_model=None)[source]

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.wrapper.JavaModel, pyspark.ml.util.JavaMLWritable, pyspark.ml.util.JavaMLReadable

Model fitted by Featurize.

This class is left empty on purpose. All necessary methods are exposed through inheritance.

static getJavaPackage()[source]

Returns package name String.

classmethod read()[source]

Returns an MLReader instance for this class.

mmlspark.featurize.IndexToValue module

class mmlspark.featurize.IndexToValue.IndexToValue(inputCol=None, outputCol=None)[source]

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters
  • inputCol (str) – The name of the input column

  • outputCol (str) – The name of the output column

getInputCol()[source]
Returns

The name of the input column

Return type

str

static getJavaPackage()[source]

Returns package name String.

getOutputCol()[source]
Returns

The name of the output column

Return type

str

classmethod read()[source]

Returns an MLReader instance for this class.

setInputCol(value)[source]
Parameters

inputCol (str) – The name of the input column

setOutputCol(value)[source]
Parameters

outputCol (str) – The name of the output column

setParams(inputCol=None, outputCol=None)[source]

Set the (keyword only) parameters

Parameters
  • inputCol (str) – The name of the input column

  • outputCol (str) – The name of the output column

mmlspark.featurize.ValueIndexer module

class mmlspark.featurize.ValueIndexer.ValueIndexer(inputCol=None, outputCol=None)[source]

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaEstimator

Parameters
  • inputCol (str) – The name of the input column

  • outputCol (str) – The name of the output column

getInputCol()[source]
Returns

The name of the input column

Return type

str

static getJavaPackage()[source]

Returns package name String.

getOutputCol()[source]
Returns

The name of the output column

Return type

str

classmethod read()[source]

Returns an MLReader instance for this class.

setInputCol(value)[source]
Parameters

inputCol (str) – The name of the input column

setOutputCol(value)[source]
Parameters

outputCol (str) – The name of the output column

setParams(inputCol=None, outputCol=None)[source]

Set the (keyword only) parameters

Parameters
  • inputCol (str) – The name of the input column

  • outputCol (str) – The name of the output column

class mmlspark.featurize.ValueIndexer.ValueIndexerModel(java_model=None)[source]

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.wrapper.JavaModel, pyspark.ml.util.JavaMLWritable, pyspark.ml.util.JavaMLReadable

Model fitted by ValueIndexer.

This class is left empty on purpose. All necessary methods are exposed through inheritance.

static getJavaPackage()[source]

Returns package name String.

classmethod read()[source]

Returns an MLReader instance for this class.

mmlspark.featurize.ValueIndexerModel module

class mmlspark.featurize.ValueIndexerModel.ValueIndexerModel(dataType='string', inputCol='input', levels=None, outputCol=None)[source]

Bases: mmlspark.core.schema.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters
  • dataType (str) – The datatype of the levels as a Json string (default: string)

  • inputCol (str) – The name of the input column (default: input)

  • levels (object) – Levels in categorical array

  • outputCol (str) – The name of the output column (default: [self.uid]_output)

getDataType()[source]
Returns

The datatype of the levels as a Json string (default: string)

Return type

str

getInputCol()[source]
Returns

The name of the input column (default: input)

Return type

str

static getJavaPackage()[source]

Returns package name String.

getLevels()[source]
Returns

Levels in categorical array

Return type

object

getOutputCol()[source]
Returns

The name of the output column (default: [self.uid]_output)

Return type

str

classmethod read()[source]

Returns an MLReader instance for this class.

setDataType(value)[source]
Parameters

dataType (str) – The datatype of the levels as a Json string (default: string)

setInputCol(value)[source]
Parameters

inputCol (str) – The name of the input column (default: input)

setLevels(value)[source]
Parameters

levels (object) – Levels in categorical array

setOutputCol(value)[source]
Parameters

outputCol (str) – The name of the output column (default: [self.uid]_output)

setParams(dataType='string', inputCol='input', levels=None, outputCol=None)[source]

Set the (keyword only) parameters

Parameters
  • dataType (str) – The datatype of the levels as a Json string (default: string)

  • inputCol (str) – The name of the input column (default: input)

  • levels (object) – Levels in categorical array

  • outputCol (str) – The name of the output column (default: [self.uid]_output)

Module contents

MicrosoftML is a library of Python classes to interface with the Microsoft scala APIs to utilize Apache Spark to create distibuted machine learning models.

MicrosoftML simplifies training and scoring classifiers and regressors, as well as facilitating the creation of models using the CNTK library, images, and text.