synapse.ml.cyber.anomaly package

Submodules

synapse.ml.cyber.anomaly.collaborative_filtering module

class synapse.ml.cyber.anomaly.collaborative_filtering.AccessAnomaly(tenantCol: str = 'tenant', userCol: str = 'user', resCol: str = 'res', likelihoodCol: str = 'likelihood', outputCol: str = 'anomaly_score', rankParam: int = 10, maxIter: int = 25, regParam: float = 1.0, numBlocks: Optional[int] = None, separateTenants: bool = False, lowValue: Optional[float] = 5.0, highValue: Optional[float] = 10.0, applyImplicitCf: bool = True, alphaParam: Optional[float] = None, complementsetFactor: Optional[int] = None, negScore: Optional[float] = None, historyAccessDf: Optional[pyspark.sql.dataframe.DataFrame] = None)[source]

Bases: pyspark.ml.base.Estimator

This is the AccessAnomaly, a pyspark.ml.Estimator which creates the AccessAnomalyModel which is a pyspark.ml.Transformer

alphaParam = Param(parent='undefined', name='alphaParam', doc='alphaParam is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.(defaults to 1.0).')
applyImplicitCf = Param(parent='undefined', name='applyImplicitCf', doc='specifies whether to use the implicit/explicit feedback ALS for the data (defaults to True which means using implicit feedback).')
complementsetFactor = Param(parent='undefined', name='complementsetFactor', doc='complementsetFactor is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.(defaults to 2).')
create_spark_model_vectors_df(df: pyspark.sql.dataframe.DataFrame) synapse.ml.cyber.anomaly.collaborative_filtering._UserResourceFeatureVectorMapping[source]
highValue = Param(parent='undefined', name='highValue', doc='highValue is used to scale the values of likelihood_col to be in the range [lowValue, highValue] (defaults to 10.0).')
historyAccessDf = Param(parent='undefined', name='historyAccessDf', doc='historyAccessDf is an optional spark dataframe which includes the list of seen user resource pairs for which the anomaly score should be zero.')
property indexed_res_col
property indexed_user_col
likelihoodCol = Param(parent='undefined', name='likelihoodCol', doc='The name of the column with the likelihood estimate for user, res access (usually based on access counts per time unit). ')
lowValue = Param(parent='undefined', name='lowValue', doc='lowValue is used to scale the values of likelihood_col to be in the range [lowValue, highValue] (defaults to 5.0).')
maxIter = Param(parent='undefined', name='maxIter', doc='maxIter is the maximum number of iterations to run (defaults to 25).')
negScore = Param(parent='undefined', name='negScore', doc='negScore is a parameter applicable to the explicit feedback variant of ALS that governs the value to assign to the values of the complement set.(defaults to 1.0).')
numBlocks = Param(parent='undefined', name='numBlocks', doc='numBlocks is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to |tenants| if separate_tenants is False else 10).')
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column representing the calculated anomaly score. Values will be between (-inf, +inf) with an estimated mean of 0.0 and standard deviation of 1.0. ')
rankParam = Param(parent='undefined', name='rankParam', doc='rankParam is the number of latent factors in the model (defaults to 10).')
regParam = Param(parent='undefined', name='regParam', doc='regParam specifies the regularization parameter in ALS (defaults to 0.1).')
resCol = Param(parent='undefined', name='resCol', doc='The name of the resource column. This is a the name of the resource column in the dataframe.')
property res_vec_col
property scaled_likelihood_col
separateTenants = Param(parent='undefined', name='separateTenants', doc='separateTenants applies the algorithm per tenant in isolation. Setting to True may reduce runtime significantly, if number of tenant is large, but will increase accuracy. (defaults to False).')
tenantCol = Param(parent='undefined', name='tenantCol', doc='The name of the tenant column. This is a unique identifier used to partition the dataframe into independent groups where the values in each such group are completely isolated from one another. Note: if this column is irrelevant for your data, then just create a tenant column and give it a single value for all rows.')
userCol = Param(parent='undefined', name='userCol', doc='The name of the user column. This is a the name of the user column in the dataframe.')
property user_vec_col
class synapse.ml.cyber.anomaly.collaborative_filtering.AccessAnomalyConfig[source]

Bases: object

Define default values for AccessAnomaly Params

default_alpha = 1.0
default_apply_implicit_cf = True
default_complementset_factor = 2
default_high_value = 10.0
default_likelihood_col = 'likelihood'
default_low_value = 5.0
default_max_iter = 25
default_neg_score = 1.0
default_num_blocks = None
default_output_col = 'anomaly_score'
default_rank = 10
default_reg_param = 1.0
default_res_col = 'res'
default_separate_tenants = False
default_tenant_col = 'tenant'
default_user_col = 'user'
class synapse.ml.cyber.anomaly.collaborative_filtering.AccessAnomalyModel(userResourceFeatureVectorMapping: synapse.ml.cyber.anomaly.collaborative_filtering._UserResourceFeatureVectorMapping, outputCol: str)[source]

Bases: pyspark.ml.base.Transformer

static load(spark: pyspark.sql.context.SQLContext, path: str, output_format: str = 'parquet') synapse.ml.cyber.anomaly.collaborative_filtering.AccessAnomalyModel[source]
outputCol = Param(parent='undefined', name='outputCol', doc='The name of the output column representing the calculated anomaly score. Values will be between (-inf, +inf) with an estimated mean of 0.0 and standard deviation of 1.0. ')

A pyspark.ml.Transformer model that can predict anomaly scores for user, resource access pairs

property res_col
property res_mapping_df
property res_vec_col
save(path: str, path_suffix: str = '', output_format: str = 'parquet')[source]
property tenant_col
property user_col
property user_mapping_df
property user_vec_col
class synapse.ml.cyber.anomaly.collaborative_filtering.ConnectedComponents(tenantCol: str, userCol: str, res_col: str, componentColName: str = 'component')[source]

Bases: object

transform(df: pyspark.sql.dataframe.DataFrame) Tuple[pyspark.sql.dataframe.DataFrame, pyspark.sql.dataframe.DataFrame][source]
class synapse.ml.cyber.anomaly.collaborative_filtering.ModelNormalizeTransformer(access_df: pyspark.sql.dataframe.DataFrame, rank: int)[source]

Bases: object

Given a UserResourceCfDataframeModel this class creates and returns a new normalized UserResourceCfDataframeModel which has an anomaly score with a mean of 0.0 and standard deviation of 1.0 when applied on the given dataframe

transform(user_res_cf_df_model: synapse.ml.cyber.anomaly.collaborative_filtering._UserResourceFeatureVectorMapping) synapse.ml.cyber.anomaly.collaborative_filtering._UserResourceFeatureVectorMapping[source]

synapse.ml.cyber.anomaly.complement_access module

class synapse.ml.cyber.anomaly.complement_access.ComplementAccessTransformer(partition_key: Optional[str], indexed_col_names_arr: List[str], complementset_factor: int)[source]

Bases: pyspark.ml.base.Transformer

complementsetFactor = Param(parent='undefined', name='complementsetFactor', doc='The estimated average size of the complement set to generate')

Given a dataframe it returns a new dataframe with access patterns sampled from the set of possible access patterns which did not occur in the given dataframe (i.e., it returns a sample from the complement set).

indexedColNamesArr = Param(parent='undefined', name='indexedColNamesArr', doc='The name of the fields to use to generate the complement set from')
partitionKey = Param(parent='undefined', name='partitionKey', doc='The name of the partition_key field name')

Module contents

SynapseML is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark in several new directions. SynapseML adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV. These tools enable powerful and highly-scalable predictive and analytical models for a variety of datasources.

SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. In this vein, SynapseML provides easy to use SparkML transformers for a wide variety of Microsoft Cognitive Services. For production grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.

SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+.