mmlspark.cyber.anomaly package


mmlspark.cyber.anomaly.collaborative_filtering module

class mmlspark.cyber.anomaly.collaborative_filtering.AccessAnomaly(tenantCol: str = 'tenant', userCol: str = 'user', resCol: str = 'res', likelihoodCol: str = 'likelihood', outputCol: str = 'anomaly_score', rankParam: int = 10, maxIter: int = 25, regParam: float = 1.0, numBlocks: Optional[int] = None, separateTenants: bool = False, lowValue: Optional[float] = 5.0, highValue: Optional[float] = 10.0, applyImplicitCf: bool = True, alphaParam: Optional[float] = None, complementsetFactor: Optional[int] = None, negScore: Optional[float] = None, historyAccessDf: Optional[pyspark.sql.dataframe.DataFrame] = None)[source]


This is the AccessAnomaly, a which creates the AccessAnomalyModel which is a

alphaParam = Param(parent='undefined', name='alphaParam', doc='alphaParam is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.(defaults to 1.0).')
applyImplicitCf = Param(parent='undefined', name='applyImplicitCf', doc='specifies whether to use the implicit/explicit feedback ALS for the data (defaults to True which means using implicit feedback).')
complementsetFactor = Param(parent='undefined', name='complementsetFactor', doc='complementsetFactor is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.(defaults to 2).')
create_spark_model_vectors_df(df: pyspark.sql.dataframe.DataFrame) mmlspark.cyber.anomaly.collaborative_filtering._UserResourceFeatureVectorMapping[source]
highValue = Param(parent='undefined', name='highValue', doc='highValue is used to scale the values of likelihood_col to be in the range [lowValue, highValue] (defaults to 10.0).')
historyAccessDf = Param(parent='undefined', name='historyAccessDf', doc='historyAccessDf is an optional spark dataframe which includes the list of seen user resource pairs for which the anomaly score should be zero.')
property indexed_res_col
property indexed_user_col
likelihoodCol = Param(parent='undefined', name='likelihoodCol', doc='The name of the column with the likelihood estimate for user, res access (usually based on access counts per time unit). ')
lowValue = Param(parent='undefined', name='lowValue', doc='lowValue is used to scale the values of likelihood_col to be in the range [lowValue, highValue] (defaults to 5.0).')
maxIter = Param(parent='undefined', name='maxIter', doc='maxIter is the maximum number of iterations to run (defaults to 25).')
negScore = Param(parent='undefined', name='negScore', doc='negScore is a parameter applicable to the explicit feedback variant of ALS that governs the value to assign to the values of the complement set.(defaults to 1.0).')
numBlocks = Param(parent='undefined', name='numBlocks', doc='numBlocks is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to |tenants| if separate_tenants is False else 10).')
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column representing the calculated anomaly score. Values will be between (-inf, +inf) with an estimated mean of 0.0 and standard deviation of 1.0. ')
rankParam = Param(parent='undefined', name='rankParam', doc='rankParam is the number of latent factors in the model (defaults to 10).')
regParam = Param(parent='undefined', name='regParam', doc='regParam specifies the regularization parameter in ALS (defaults to 0.1).')
resCol = Param(parent='undefined', name='resCol', doc='The name of the resource column. This is a the name of the resource column in the dataframe.')
property res_vec_col
property scaled_likelihood_col
separateTenants = Param(parent='undefined', name='separateTenants', doc='separateTenants applies the algorithm per tenant in isolation. Setting to True may reduce runtime significantly, if number of tenant is large, but will increase accuracy. (defaults to False).')
tenantCol = Param(parent='undefined', name='tenantCol', doc='The name of the tenant column. This is a unique identifier used to partition the dataframe into independent groups where the values in each such group are completely isolated from one another. Note: if this column is irrelevant for your data, then just create a tenant column and give it a single value for all rows.')
userCol = Param(parent='undefined', name='userCol', doc='The name of the user column. This is a the name of the user column in the dataframe.')
property user_vec_col
class mmlspark.cyber.anomaly.collaborative_filtering.AccessAnomalyConfig[source]

Bases: object

Define default values for AccessAnomaly Params

default_alpha = 1.0
default_apply_implicit_cf = True
default_complementset_factor = 2
default_high_value = 10.0
default_likelihood_col = 'likelihood'
default_low_value = 5.0
default_max_iter = 25
default_neg_score = 1.0
default_num_blocks = None
default_output_col = 'anomaly_score'
default_rank = 10
default_reg_param = 1.0
default_res_col = 'res'
default_separate_tenants = False
default_tenant_col = 'tenant'
default_user_col = 'user'
class mmlspark.cyber.anomaly.collaborative_filtering.AccessAnomalyModel(userResourceFeatureVectorMapping: mmlspark.cyber.anomaly.collaborative_filtering._UserResourceFeatureVectorMapping, outputCol: str)[source]


static load(spark: pyspark.sql.context.SQLContext, path: str, output_format: str = 'parquet') mmlspark.cyber.anomaly.collaborative_filtering.AccessAnomalyModel[source]
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column representing the calculated anomaly score. Values will be between (-inf, +inf) with an estimated mean of 0.0 and standard deviation of 1.0. ')

A model that can predict anomaly scores for user, resource access pairs

property res_col
property res_mapping_df
property res_vec_col
save(path: str, path_suffix: str = '', output_format: str = 'parquet')[source]
property tenant_col
property user_col
property user_mapping_df
property user_vec_col
class mmlspark.cyber.anomaly.collaborative_filtering.ConnectedComponents(tenantCol: str, userCol: str, res_col: str, componentColName: str = 'component')[source]

Bases: object

transform(df: pyspark.sql.dataframe.DataFrame) Tuple[pyspark.sql.dataframe.DataFrame, pyspark.sql.dataframe.DataFrame][source]
class mmlspark.cyber.anomaly.collaborative_filtering.ModelNormalizeTransformer(access_df: pyspark.sql.dataframe.DataFrame, rank: int)[source]

Bases: object

Given a UserResourceCfDataframeModel this class creates and returns a new normalized UserResourceCfDataframeModel which has an anomaly score with a mean of 0.0 and standard deviation of 1.0 when applied on the given dataframe

transform(user_res_cf_df_model: mmlspark.cyber.anomaly.collaborative_filtering._UserResourceFeatureVectorMapping) mmlspark.cyber.anomaly.collaborative_filtering._UserResourceFeatureVectorMapping[source]

mmlspark.cyber.anomaly.complement_access module

class mmlspark.cyber.anomaly.complement_access.ComplementAccessTransformer(partition_key: Optional[str], indexed_col_names_arr: List[str], complementset_factor: int)[source]


complementsetFactor = Param(parent='undefined', name='complementsetFactor', doc='The estimated average size of the complement set to generate')

Given a dataframe it returns a new dataframe with access patterns sampled from the set of possible access patterns which did not occur in the given dataframe (i.e., it returns a sample from the complement set).

indexedColNamesArr = Param(parent='undefined', name='indexedColNamesArr', doc='The name of the fields to use to generate the complement set from')
partitionKey = Param(parent='undefined', name='partitionKey', doc='The name of the partition_key field name')

Module contents

