mmlspark.cyber.feature package¶
Submodules¶
mmlspark.cyber.feature.indexers module¶
-
class
mmlspark.cyber.feature.indexers.
IdIndexer
(input_col: str, partition_key: str, output_col: str, reset_per_partition: bool)[source]¶ Bases:
pyspark.ml.base.Estimator
,pyspark.ml.param.shared.HasInputCol
,pyspark.ml.param.shared.HasOutputCol
-
partitionKey
= Param(parent='undefined', name='partitionKey', doc='The name of the column to partition by, i.e., make sure the indexing takes the partition into account. This is exemplified in reset_per_partition.')¶
-
resetPerPartition
= Param(parent='undefined', name='resetPerPartition', doc='When set to True then indexing is consecutive from [1..n] for each value of the partition column. When set to False then indexing is consecutive for all partition and column values.')¶
-
-
class
mmlspark.cyber.feature.indexers.
IdIndexerModel
(input_col: str, partition_key: str, output_col: str, vocab_df: pyspark.sql.dataframe.DataFrame)[source]¶ Bases:
pyspark.ml.base.Transformer
,pyspark.ml.param.shared.HasInputCol
,pyspark.ml.param.shared.HasOutputCol
-
partitionKey
= Param(parent='undefined', name='partitionKey', doc='The name of the column to partition by, i.e., make sure the indexing takes the partition into account. This is exemplified in reset_per_partition.')¶
-
-
class
mmlspark.cyber.feature.indexers.
MultiIndexer
(indexers: List[mmlspark.cyber.feature.indexers.IdIndexer])[source]¶ Bases:
pyspark.ml.base.Estimator
-
class
mmlspark.cyber.feature.indexers.
MultiIndexerModel
(models: List[mmlspark.cyber.feature.indexers.IdIndexerModel])[source]¶ Bases:
pyspark.ml.base.Transformer
mmlspark.cyber.feature.scalers module¶
-
class
mmlspark.cyber.feature.scalers.
LinearScalarScaler
(input_col: str, partition_key: Optional[str], output_col: str, min_required_value: float = 0.0, max_required_value: float = 1.0, use_pandas: bool = True)[source]¶ Bases:
mmlspark.cyber.feature.scalers.PerPartitionScalarScalerEstimator
-
maxRequiredValue
= Param(parent='undefined', name='maxRequiredValue', doc='Scale the outputCol to have a value between [minRequiredValue, maxRequiredValue].')¶
-
minRequiredValue
= Param(parent='undefined', name='minRequiredValue', doc='Scale the outputCol to have a value between [minRequiredValue, maxRequiredValue].')¶
-
-
class
mmlspark.cyber.feature.scalers.
LinearScalarScalerConfig
[source]¶ Bases:
object
-
max_actual_value_token
= '__max_actual_value__'¶
-
min_actual_value_token
= '__min_actual_value__'¶
-
-
class
mmlspark.cyber.feature.scalers.
LinearScalarScalerModel
(input_col: str, partition_key: Optional[str], output_col: str, per_group_stats: Union[pyspark.sql.dataframe.DataFrame, Dict[str, float]], min_required_value: float, max_required_value: float, use_pandas: bool = True)[source]¶ Bases:
mmlspark.cyber.feature.scalers.PerPartitionScalarScalerModel
-
class
mmlspark.cyber.feature.scalers.
PerPartitionScalarScalerEstimator
(input_col: str, partition_key: Optional[str], output_col: str, use_pandas: bool = True)[source]¶ Bases:
abc.ABC
,pyspark.ml.base.Estimator
,pyspark.ml.param.shared.HasInputCol
,pyspark.ml.param.shared.HasOutputCol
-
partitionKey
= Param(parent='undefined', name='partitionKey', doc='The name of the column to partition by, i.e., scale the values of inputCol within each partition. ')¶
-
property
use_pandas
¶
-
-
class
mmlspark.cyber.feature.scalers.
PerPartitionScalarScalerModel
(input_col: str, partition_key: Optional[str], output_col: str, per_group_stats: Union[pyspark.sql.dataframe.DataFrame, Dict[str, float]], use_pandas: bool = True)[source]¶ Bases:
abc.ABC
,pyspark.ml.base.Transformer
,pyspark.ml.param.shared.HasInputCol
,pyspark.ml.param.shared.HasOutputCol
-
partitionKey
= Param(parent='undefined', name='partitionKey', doc='The name of the column to partition by, i.e., scale the values of inputCol within each partition. ')¶
-
property
per_group_stats
¶
-
property
use_pandas
¶
-
-
class
mmlspark.cyber.feature.scalers.
StandardScalarScaler
(input_col: str, partition_key: Optional[str], output_col: str, coefficient_factor: float = 1.0, use_pandas: bool = True)[source]¶ Bases:
mmlspark.cyber.feature.scalers.PerPartitionScalarScalerEstimator
-
coefficientFactor
= Param(parent='undefined', name='coefficientFactor', doc='After scaling values of outputCol are multiplied by coefficient (defaults to 1.0). ')¶
-
-
class
mmlspark.cyber.feature.scalers.
StandardScalarScalerConfig
[source]¶ Bases:
object
The tokens to use for temporary representation of mean and standard deviation
-
mean_token
= '__mean__'¶
-
std_token
= '__std__'¶
-
-
class
mmlspark.cyber.feature.scalers.
StandardScalarScalerModel
(input_col: str, partition_key: Optional[str], output_col: str, per_group_stats: Union[pyspark.sql.dataframe.DataFrame, Dict[str, float]], coefficient_factor: float = 1.0, use_pandas: bool = True)[source]¶ Bases:
mmlspark.cyber.feature.scalers.PerPartitionScalarScalerModel
-
coefficientFactor
= Param(parent='undefined', name='coefficientFactor', doc='After scaling values of outputCol are multiplied by coefficient (defaults to 1.0). ')¶
-
Module contents¶
MicrosoftML is a library of Python classes to interface with the Microsoft scala APIs to utilize Apache Spark to create distibuted machine learning models.
MicrosoftML simplifies training and scoring classifiers and regressors, as well as facilitating the creation of models using the CNTK library, images, and text.