mmlspark.cyber.anomaly package¶
Submodules¶
mmlspark.cyber.anomaly.collaborative_filtering module¶
-
class
mmlspark.cyber.anomaly.collaborative_filtering.
AccessAnomaly
(tenantCol: str = 'tenant', userCol: str = 'user', resCol: str = 'res', likelihoodCol: str = 'likelihood', outputCol: str = 'anomaly_score', rankParam: int = 10, maxIter: int = 25, regParam: float = 1.0, numBlocks: Optional[int] = None, separateTenants: bool = False, lowValue: Optional[float] = 5.0, highValue: Optional[float] = 10.0, applyImplicitCf: bool = True, alphaParam: Optional[float] = None, complementsetFactor: Optional[int] = None, negScore: Optional[float] = None, historyAccessDf: Optional[pyspark.sql.dataframe.DataFrame] = None)[source]¶ Bases:
pyspark.ml.base.Estimator
This is the AccessAnomaly, a pyspark.ml.Estimator which creates the AccessAnomalyModel which is a pyspark.ml.Transformer
-
alphaParam
= Param(parent='undefined', name='alphaParam', doc='alphaParam is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.(defaults to 1.0).')¶
-
applyImplicitCf
= Param(parent='undefined', name='applyImplicitCf', doc='specifies whether to use the implicit/explicit feedback ALS for the data (defaults to True which means using implicit feedback).')¶
-
complementsetFactor
= Param(parent='undefined', name='complementsetFactor', doc='complementsetFactor is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.(defaults to 2).')¶
-
create_spark_model_vectors_df
(df: pyspark.sql.dataframe.DataFrame) → mmlspark.cyber.anomaly.collaborative_filtering._UserResourceFeatureVectorMapping[source]¶
-
highValue
= Param(parent='undefined', name='highValue', doc='highValue is used to scale the values of likelihood_col to be in the range [lowValue, highValue] (defaults to 10.0).')¶
-
historyAccessDf
= Param(parent='undefined', name='historyAccessDf', doc='historyAccessDf is an optional spark dataframe which includes the list of seen user resource pairs for which the anomaly score should be zero.')¶
-
property
indexed_res_col
¶
-
property
indexed_user_col
¶
-
likelihoodCol
= Param(parent='undefined', name='likelihoodCol', doc='The name of the column with the likelihood estimate for user, res access (usually based on access counts per time unit). ')¶
-
lowValue
= Param(parent='undefined', name='lowValue', doc='lowValue is used to scale the values of likelihood_col to be in the range [lowValue, highValue] (defaults to 5.0).')¶
-
maxIter
= Param(parent='undefined', name='maxIter', doc='maxIter is the maximum number of iterations to run (defaults to 25).')¶
-
negScore
= Param(parent='undefined', name='negScore', doc='negScore is a parameter applicable to the explicit feedback variant of ALS that governs the value to assign to the values of the complement set.(defaults to 1.0).')¶
-
numBlocks
= Param(parent='undefined', name='numBlocks', doc='numBlocks is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to |tenants| if separate_tenants is False else 10).')¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='The name of the output column representing the calculated anomaly score. Values will be between (-inf, +inf) with an estimated mean of 0.0 and standard deviation of 1.0. ')¶
-
rankParam
= Param(parent='undefined', name='rankParam', doc='rankParam is the number of latent factors in the model (defaults to 10).')¶
-
regParam
= Param(parent='undefined', name='regParam', doc='regParam specifies the regularization parameter in ALS (defaults to 0.1).')¶
-
resCol
= Param(parent='undefined', name='resCol', doc='The name of the resource column. This is a the name of the resource column in the dataframe.')¶
-
property
res_vec_col
¶
-
property
scaled_likelihood_col
¶
-
separateTenants
= Param(parent='undefined', name='separateTenants', doc='separateTenants applies the algorithm per tenant in isolation. Setting to True may reduce runtime significantly, if number of tenant is large, but will increase accuracy. (defaults to False).')¶
-
tenantCol
= Param(parent='undefined', name='tenantCol', doc='The name of the tenant column. This is a unique identifier used to partition the dataframe into independent groups where the values in each such group are completely isolated from one another. Note: if this column is irrelevant for your data, then just create a tenant column and give it a single value for all rows.')¶
-
userCol
= Param(parent='undefined', name='userCol', doc='The name of the user column. This is a the name of the user column in the dataframe.')¶
-
property
user_vec_col
¶
-
-
class
mmlspark.cyber.anomaly.collaborative_filtering.
AccessAnomalyConfig
[source]¶ Bases:
object
Define default values for AccessAnomaly Params
-
default_alpha
= 1.0¶
-
default_apply_implicit_cf
= True¶
-
default_complementset_factor
= 2¶
-
default_high_value
= 10.0¶
-
default_likelihood_col
= 'likelihood'¶
-
default_low_value
= 5.0¶
-
default_max_iter
= 25¶
-
default_neg_score
= 1.0¶
-
default_num_blocks
= None¶
-
default_output_col
= 'anomaly_score'¶
-
default_rank
= 10¶
-
default_reg_param
= 1.0¶
-
default_res_col
= 'res'¶
-
default_separate_tenants
= False¶
-
default_tenant_col
= 'tenant'¶
-
default_user_col
= 'user'¶
-
-
class
mmlspark.cyber.anomaly.collaborative_filtering.
AccessAnomalyModel
(userResourceFeatureVectorMapping: mmlspark.cyber.anomaly.collaborative_filtering._UserResourceFeatureVectorMapping, outputCol: str)[source]¶ Bases:
pyspark.ml.base.Transformer
-
static
load
(spark: pyspark.sql.context.SQLContext, path: str, output_format: str = 'parquet') → mmlspark.cyber.anomaly.collaborative_filtering.AccessAnomalyModel[source]¶
-
outputCol
= Param(parent='undefined', name='outputCol', doc='The name of the output column representing the calculated anomaly score. Values will be between (-inf, +inf) with an estimated mean of 0.0 and standard deviation of 1.0. ')¶ A pyspark.ml.Transformer model that can predict anomaly scores for user, resource access pairs
-
property
res_col
¶
-
property
res_mapping_df
¶
-
property
res_vec_col
¶
-
property
tenant_col
¶
-
property
user_col
¶
-
property
user_mapping_df
¶
-
property
user_vec_col
¶
-
static
-
class
mmlspark.cyber.anomaly.collaborative_filtering.
ConnectedComponents
(tenantCol: str, userCol: str, res_col: str, componentColName: str = 'component')[source]¶ Bases:
object
-
class
mmlspark.cyber.anomaly.collaborative_filtering.
ModelNormalizeTransformer
(access_df: pyspark.sql.dataframe.DataFrame, rank: int)[source]¶ Bases:
object
Given a UserResourceCfDataframeModel this class creates and returns a new normalized UserResourceCfDataframeModel which has an anomaly score with a mean of 0.0 and standard deviation of 1.0 when applied on the given dataframe
mmlspark.cyber.anomaly.complement_access module¶
-
class
mmlspark.cyber.anomaly.complement_access.
ComplementAccessTransformer
(partition_key: Optional[str], indexed_col_names_arr: List[str], complementset_factor: int)[source]¶ Bases:
pyspark.ml.base.Transformer
-
complementsetFactor
= Param(parent='undefined', name='complementsetFactor', doc='The estimated average size of the complement set to generate')¶ Given a dataframe it returns a new dataframe with access patterns sampled from the set of possible access patterns which did not occur in the given dataframe (i.e., it returns a sample from the complement set).
-
indexedColNamesArr
= Param(parent='undefined', name='indexedColNamesArr', doc='The name of the fields to use to generate the complement set from')¶
-
partitionKey
= Param(parent='undefined', name='partitionKey', doc='The name of the partition_key field name')¶
-
Module contents¶
MicrosoftML is a library of Python classes to interface with the Microsoft scala APIs to utilize Apache Spark to create distibuted machine learning models.
MicrosoftML simplifies training and scoring classifiers and regressors, as well as facilitating the creation of models using the CNTK library, images, and text.