mmlspark.cyber.anomaly package¶

Submodules¶

mmlspark.cyber.anomaly.collaborative_filtering module¶

class mmlspark.cyber.anomaly.collaborative_filtering.AccessAnomaly(tenantCol: str = 'tenant', userCol: str = 'user', resCol: str = 'res', likelihoodCol: str = 'likelihood', outputCol: str = 'anomaly_score', rankParam: int = 10, maxIter: int = 25, regParam: float = 1.0, numBlocks: Optional[int] = None, separateTenants: bool = False, lowValue: Optional[float] = 5.0, highValue: Optional[float] = 10.0, applyImplicitCf: bool = True, alphaParam: Optional[float] = None, complementsetFactor: Optional[int] = None, negScore: Optional[float] = None, historyAccessDf: Optional[pyspark.sql.dataframe.DataFrame] = None)[source]¶

Bases: pyspark.ml.base.Estimator

This is the AccessAnomaly, a pyspark.ml.Estimator which creates the AccessAnomalyModel which is a pyspark.ml.Transformer

alphaParam = Param(parent='undefined', name='alphaParam', doc='alphaParam is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.(defaults to 1.0).')¶

applyImplicitCf = Param(parent='undefined', name='applyImplicitCf', doc='specifies whether to use the implicit/explicit feedback ALS for the data (defaults to True which means using implicit feedback).')¶

complementsetFactor = Param(parent='undefined', name='complementsetFactor', doc='complementsetFactor is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.(defaults to 2).')¶

create_spark_model_vectors_df(df: pyspark.sql.dataframe.DataFrame) → mmlspark.cyber.anomaly.collaborative_filtering._UserResourceFeatureVectorMapping[source]¶

highValue = Param(parent='undefined', name='highValue', doc='highValue is used to scale the values of likelihood_col to be in the range [lowValue, highValue] (defaults to 10.0).')¶

historyAccessDf = Param(parent='undefined', name='historyAccessDf', doc='historyAccessDf is an optional spark dataframe which includes the list of seen user resource pairs for which the anomaly score should be zero.')¶

property indexed_res_col¶

property indexed_user_col¶

likelihoodCol = Param(parent='undefined', name='likelihoodCol', doc='The name of the column with the likelihood estimate for user, res access (usually based on access counts per time unit). ')¶

lowValue = Param(parent='undefined', name='lowValue', doc='lowValue is used to scale the values of likelihood_col to be in the range [lowValue, highValue] (defaults to 5.0).')¶

maxIter = Param(parent='undefined', name='maxIter', doc='maxIter is the maximum number of iterations to run (defaults to 25).')¶

negScore = Param(parent='undefined', name='negScore', doc='negScore is a parameter applicable to the explicit feedback variant of ALS that governs the value to assign to the values of the complement set.(defaults to 1.0).')¶

numBlocks = Param(parent='undefined', name='numBlocks', doc='numBlocks is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to |tenants| if separate_tenants is False else 10).')¶

outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column representing the calculated anomaly score. Values will be between (-inf, +inf) with an estimated mean of 0.0 and standard deviation of 1.0. ')¶

rankParam = Param(parent='undefined', name='rankParam', doc='rankParam is the number of latent factors in the model (defaults to 10).')¶

regParam = Param(parent='undefined', name='regParam', doc='regParam specifies the regularization parameter in ALS (defaults to 0.1).')¶

resCol = Param(parent='undefined', name='resCol', doc='The name of the resource column. This is a the name of the resource column in the dataframe.')¶

property res_vec_col¶

property scaled_likelihood_col¶

separateTenants = Param(parent='undefined', name='separateTenants', doc='separateTenants applies the algorithm per tenant in isolation. Setting to True may reduce runtime significantly, if number of tenant is large, but will increase accuracy. (defaults to False).')¶

tenantCol = Param(parent='undefined', name='tenantCol', doc='The name of the tenant column. This is a unique identifier used to partition the dataframe into independent groups where the values in each such group are completely isolated from one another. Note: if this column is irrelevant for your data, then just create a tenant column and give it a single value for all rows.')¶

userCol = Param(parent='undefined', name='userCol', doc='The name of the user column. This is a the name of the user column in the dataframe.')¶

property user_vec_col¶

class mmlspark.cyber.anomaly.collaborative_filtering.AccessAnomalyConfig[source]¶

Bases: object

Define default values for AccessAnomaly Params

default_alpha = 1.0¶

default_apply_implicit_cf = True¶

default_complementset_factor = 2¶

default_high_value = 10.0¶

default_likelihood_col = 'likelihood'¶

default_low_value = 5.0¶

default_max_iter = 25¶

default_neg_score = 1.0¶

default_num_blocks = None¶

default_output_col = 'anomaly_score'¶

default_rank = 10¶

default_reg_param = 1.0¶

default_res_col = 'res'¶

default_separate_tenants = False¶

default_tenant_col = 'tenant'¶

default_user_col = 'user'¶

class mmlspark.cyber.anomaly.collaborative_filtering.AccessAnomalyModel(userResourceFeatureVectorMapping: mmlspark.cyber.anomaly.collaborative_filtering._UserResourceFeatureVectorMapping, outputCol: str)[source]¶

Bases: pyspark.ml.base.Transformer

static load(spark: pyspark.sql.context.SQLContext, path: str, output_format: str = 'parquet') → mmlspark.cyber.anomaly.collaborative_filtering.AccessAnomalyModel [source]¶

outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column representing the calculated anomaly score. Values will be between (-inf, +inf) with an estimated mean of 0.0 and standard deviation of 1.0. ')¶: A pyspark.ml.Transformer model that can predict anomaly scores for user, resource access pairs

property res_col¶

property res_mapping_df¶

property res_vec_col¶

save(path: str, path_suffix: str = '', output_format: str = 'parquet')[source]¶

property tenant_col¶

property user_col¶

property user_mapping_df¶

property user_vec_col¶

class mmlspark.cyber.anomaly.collaborative_filtering.ConnectedComponents(tenantCol: str, userCol: str, res_col: str, componentColName: str = 'component')[source]¶

Bases: object

transform(df: pyspark.sql.dataframe.DataFrame) → Tuple[pyspark.sql.dataframe.DataFrame, pyspark.sql.dataframe.DataFrame][source]¶

class mmlspark.cyber.anomaly.collaborative_filtering.ModelNormalizeTransformer(access_df: pyspark.sql.dataframe.DataFrame, rank: int)[source]¶

Bases: object

Given a UserResourceCfDataframeModel this class creates and returns a new normalized UserResourceCfDataframeModel which has an anomaly score with a mean of 0.0 and standard deviation of 1.0 when applied on the given dataframe

transform(user_res_cf_df_model: mmlspark.cyber.anomaly.collaborative_filtering._UserResourceFeatureVectorMapping) → mmlspark.cyber.anomaly.collaborative_filtering._UserResourceFeatureVectorMapping[source]¶

mmlspark.cyber.anomaly.complement_access module¶

class mmlspark.cyber.anomaly.complement_access.ComplementAccessTransformer(partition_key: Optional[str], indexed_col_names_arr: List[str], complementset_factor: int)[source]¶

Bases: pyspark.ml.base.Transformer

complementsetFactor = Param(parent='undefined', name='complementsetFactor', doc='The estimated average size of the complement set to generate')¶: Given a dataframe it returns a new dataframe with access patterns sampled from the set of possible access patterns which did not occur in the given dataframe (i.e., it returns a sample from the complement set).

indexedColNamesArr = Param(parent='undefined', name='indexedColNamesArr', doc='The name of the fields to use to generate the complement set from')¶

partitionKey = Param(parent='undefined', name='partitionKey', doc='The name of the partition_key field name')¶

Module contents¶

MicrosoftML is a library of Python classes to interface with the Microsoft scala APIs to utilize Apache Spark to create distibuted machine learning models.

MicrosoftML simplifies training and scoring classifiers and regressors, as well as facilitating the creation of models using the CNTK library, images, and text.