Creates a vector column of features from a collection of feature columns
Model produced by AssembleFeatures.
Removes missing values from input dataset.
Model produced by CleanMissingData.
Class containing the list of column names to perform special featurization steps for.
Class containing the list of column names to perform special featurization steps for. colNamesToHash - List of column names to hash. colNamesToDuplicateForMissings - List of column names containing doubles to duplicate so we can remove missing values from them. colNamesToTypes - Map of column names to their types. colNamesToCleanMissings - List of column names to clean missing values from (ignore). colNamesToVectorize - List of column names to vectorize using FastVectorAssembler. categoricalColumns - List of categorical columns to pass through or turn into indicator array. conversionColumnNamesMap - Map from old column names to new. addedColumnNamesMap - Map from old columns to newly generated columns for featurization.
Converts the specified list of columns to the specified type.
Converts the specified list of columns to the specified type. Returns a new DataFrame with the converted columns
Featurizes a dataset.
Featurizes a dataset. Converts the specified columns to feature columns.
This class takes in a categorical column with MML style attributes and then transforms it back to the original values.
This class takes in a categorical column with MML style attributes and then transforms it back to the original values. This extends sparkML IndexToString by allowing the transformation back to any types of values.
Fits a dictionary of values from the input column.
Fits a dictionary of values from the input column. Model then transforms a column to a categorical column of the given array of values. Similar to StringIndexer except it can be used on any value types.
Model produced by ValueIndexer.
Removes missing values from input dataset. The following modes are supported: Mean - replaces missings with mean of fit column Median - replaces missings with approximate median of fit column Custom - replaces missings with custom value specified by user For mean and median modes, only numeric column types are supported, specifically:
Int
,Long
,Float
,Double
For custom mode, the types above are supported and additionally:String
,Boolean