mmlspark.featurize.text package¶
Submodules¶
mmlspark.featurize.text.MultiNGram module¶
-
class
mmlspark.featurize.text.MultiNGram.
MultiNGram
(inputCol=None, lengths=None, outputCol=None)[source]¶ Bases:
mmlspark.core.schema.Utils.ComplexParamsMixin
,pyspark.ml.util.JavaMLReadable
,pyspark.ml.util.JavaMLWritable
,pyspark.ml.wrapper.JavaTransformer
- Parameters
-
getOutputCol
()[source]¶ - Returns
The name of the output column (default: [self.uid]_output)
- Return type
-
setLengths
(value)[source]¶ - Parameters
lengths (object) – the collection of lengths to use for ngram extraction
-
setOutputCol
(value)[source]¶ - Parameters
outputCol (str) – The name of the output column (default: [self.uid]_output)
mmlspark.featurize.text.PageSplitter module¶
-
class
mmlspark.featurize.text.PageSplitter.
PageSplitter
(boundaryRegex='\s', inputCol=None, maximumPageLength=5000, minimumPageLength=4500, outputCol=None)[source]¶ Bases:
mmlspark.core.schema.Utils.ComplexParamsMixin
,pyspark.ml.util.JavaMLReadable
,pyspark.ml.util.JavaMLWritable
,pyspark.ml.wrapper.JavaTransformer
- Parameters
boundaryRegex (str) – how to split into words (default: s)
inputCol (str) – The name of the input column
maximumPageLength (int) – the maximum number of characters to be in a page (default: 5000)
minimumPageLength (int) – the the minimum number of characters to have on a page in order to preserve work boundaries (default: 4500)
outputCol (str) – The name of the output column (default: [self.uid]_output)
-
getMaximumPageLength
()[source]¶ - Returns
the maximum number of characters to be in a page (default: 5000)
- Return type
-
getMinimumPageLength
()[source]¶ - Returns
the the minimum number of characters to have on a page in order to preserve work boundaries (default: 4500)
- Return type
-
getOutputCol
()[source]¶ - Returns
The name of the output column (default: [self.uid]_output)
- Return type
-
setBoundaryRegex
(value)[source]¶ - Parameters
boundaryRegex (str) – how to split into words (default: s)
-
setMaximumPageLength
(value)[source]¶ - Parameters
maximumPageLength (int) – the maximum number of characters to be in a page (default: 5000)
-
setMinimumPageLength
(value)[source]¶ - Parameters
minimumPageLength (int) – the the minimum number of characters to have on a page in order to preserve work boundaries (default: 4500)
-
setOutputCol
(value)[source]¶ - Parameters
outputCol (str) – The name of the output column (default: [self.uid]_output)
-
setParams
(boundaryRegex='\\s', inputCol=None, maximumPageLength=5000, minimumPageLength=4500, outputCol=None)[source]¶ Set the (keyword only) parameters
- Parameters
boundaryRegex (str) – how to split into words (default: s)
inputCol (str) – The name of the input column
maximumPageLength (int) – the maximum number of characters to be in a page (default: 5000)
minimumPageLength (int) – the the minimum number of characters to have on a page in order to preserve work boundaries (default: 4500)
outputCol (str) – The name of the output column (default: [self.uid]_output)
mmlspark.featurize.text.TextFeaturizer module¶
-
class
mmlspark.featurize.text.TextFeaturizer.
TextFeaturizer
(binary=False, caseSensitiveStopWords=False, defaultStopWordLanguage='english', inputCol=None, minDocFreq=1, minTokenLength=0, nGramLength=2, numFeatures=262144, outputCol=None, stopWords=None, toLowercase=True, tokenizerGaps=True, tokenizerPattern='\s+', useIDF=True, useNGram=False, useStopWordsRemover=False, useTokenizer=True)[source]¶ Bases:
mmlspark.core.schema.Utils.ComplexParamsMixin
,pyspark.ml.util.JavaMLReadable
,pyspark.ml.util.JavaMLWritable
,pyspark.ml.wrapper.JavaEstimator
- Parameters
binary (bool) – If true, all nonegative word counts are set to 1 (default: false)
caseSensitiveStopWords (bool) – Whether to do a case sensitive comparison over the stop words (default: false)
defaultStopWordLanguage (str) – Which language to use for the stop word remover, set this to custom to use the stopWords input (default: english)
inputCol (str) – The name of the input column
minDocFreq (int) – The minimum number of documents in which a term should appear. (default: 1)
minTokenLength (int) – Minimum token length, >= 0. (default: 0)
nGramLength (int) – The size of the Ngrams (default: 2)
numFeatures (int) – Set the number of features to hash each document to (default: 262144)
outputCol (str) – The name of the output column (default: [self.uid]_output)
stopWords (str) – The words to be filtered out.
toLowercase (bool) – Indicates whether to convert all characters to lowercase before tokenizing. (default: true)
tokenizerGaps (bool) – Indicates whether regex splits on gaps (true) or matches tokens (false). (default: true)
tokenizerPattern (str) – Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. (default: s+)
useIDF (bool) – Whether to scale the Term Frequencies by IDF (default: true)
useNGram (bool) – Whether to enumerate N grams (default: false)
useStopWordsRemover (bool) – Whether to remove stop words from tokenized data (default: false)
useTokenizer (bool) – Whether to tokenize the input (default: true)
-
getBinary
()[source]¶ - Returns
If true, all nonegative word counts are set to 1 (default: false)
- Return type
-
getCaseSensitiveStopWords
()[source]¶ - Returns
Whether to do a case sensitive comparison over the stop words (default: false)
- Return type
-
getDefaultStopWordLanguage
()[source]¶ - Returns
Which language to use for the stop word remover, set this to custom to use the stopWords input (default: english)
- Return type
-
getMinDocFreq
()[source]¶ - Returns
The minimum number of documents in which a term should appear. (default: 1)
- Return type
-
getNumFeatures
()[source]¶ - Returns
Set the number of features to hash each document to (default: 262144)
- Return type
-
getOutputCol
()[source]¶ - Returns
The name of the output column (default: [self.uid]_output)
- Return type
-
getToLowercase
()[source]¶ - Returns
Indicates whether to convert all characters to lowercase before tokenizing. (default: true)
- Return type
-
getTokenizerGaps
()[source]¶ - Returns
Indicates whether regex splits on gaps (true) or matches tokens (false). (default: true)
- Return type
-
getTokenizerPattern
()[source]¶ - Returns
Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. (default: s+)
- Return type
-
getUseIDF
()[source]¶ - Returns
Whether to scale the Term Frequencies by IDF (default: true)
- Return type
-
getUseStopWordsRemover
()[source]¶ - Returns
Whether to remove stop words from tokenized data (default: false)
- Return type
-
setBinary
(value)[source]¶ - Parameters
binary (bool) – If true, all nonegative word counts are set to 1 (default: false)
-
setCaseSensitiveStopWords
(value)[source]¶ - Parameters
caseSensitiveStopWords (bool) – Whether to do a case sensitive comparison over the stop words (default: false)
-
setDefaultStopWordLanguage
(value)[source]¶ - Parameters
defaultStopWordLanguage (str) – Which language to use for the stop word remover, set this to custom to use the stopWords input (default: english)
-
setMinDocFreq
(value)[source]¶ - Parameters
minDocFreq (int) – The minimum number of documents in which a term should appear. (default: 1)
-
setMinTokenLength
(value)[source]¶ - Parameters
minTokenLength (int) – Minimum token length, >= 0. (default: 0)
-
setNumFeatures
(value)[source]¶ - Parameters
numFeatures (int) – Set the number of features to hash each document to (default: 262144)
-
setOutputCol
(value)[source]¶ - Parameters
outputCol (str) – The name of the output column (default: [self.uid]_output)
-
setParams
(binary=False, caseSensitiveStopWords=False, defaultStopWordLanguage='english', inputCol=None, minDocFreq=1, minTokenLength=0, nGramLength=2, numFeatures=262144, outputCol=None, stopWords=None, toLowercase=True, tokenizerGaps=True, tokenizerPattern='\\s+', useIDF=True, useNGram=False, useStopWordsRemover=False, useTokenizer=True)[source]¶ Set the (keyword only) parameters
- Parameters
binary (bool) – If true, all nonegative word counts are set to 1 (default: false)
caseSensitiveStopWords (bool) – Whether to do a case sensitive comparison over the stop words (default: false)
defaultStopWordLanguage (str) – Which language to use for the stop word remover, set this to custom to use the stopWords input (default: english)
inputCol (str) – The name of the input column
minDocFreq (int) – The minimum number of documents in which a term should appear. (default: 1)
minTokenLength (int) – Minimum token length, >= 0. (default: 0)
nGramLength (int) – The size of the Ngrams (default: 2)
numFeatures (int) – Set the number of features to hash each document to (default: 262144)
outputCol (str) – The name of the output column (default: [self.uid]_output)
stopWords (str) – The words to be filtered out.
toLowercase (bool) – Indicates whether to convert all characters to lowercase before tokenizing. (default: true)
tokenizerGaps (bool) – Indicates whether regex splits on gaps (true) or matches tokens (false). (default: true)
tokenizerPattern (str) – Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. (default: s+)
useIDF (bool) – Whether to scale the Term Frequencies by IDF (default: true)
useNGram (bool) – Whether to enumerate N grams (default: false)
useStopWordsRemover (bool) – Whether to remove stop words from tokenized data (default: false)
useTokenizer (bool) – Whether to tokenize the input (default: true)
-
setToLowercase
(value)[source]¶ - Parameters
toLowercase (bool) – Indicates whether to convert all characters to lowercase before tokenizing. (default: true)
-
setTokenizerGaps
(value)[source]¶ - Parameters
tokenizerGaps (bool) – Indicates whether regex splits on gaps (true) or matches tokens (false). (default: true)
-
setTokenizerPattern
(value)[source]¶ - Parameters
tokenizerPattern (str) – Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. (default: s+)
-
setUseIDF
(value)[source]¶ - Parameters
useIDF (bool) – Whether to scale the Term Frequencies by IDF (default: true)
-
setUseNGram
(value)[source]¶ - Parameters
useNGram (bool) – Whether to enumerate N grams (default: false)
-
class
mmlspark.featurize.text.TextFeaturizer.
TextFeaturizerModel
(java_model=None)[source]¶ Bases:
mmlspark.core.schema.Utils.ComplexParamsMixin
,pyspark.ml.wrapper.JavaModel
,pyspark.ml.util.JavaMLWritable
,pyspark.ml.util.JavaMLReadable
Model fitted by
TextFeaturizer
.This class is left empty on purpose. All necessary methods are exposed through inheritance.
Module contents¶
MicrosoftML is a library of Python classes to interface with the Microsoft scala APIs to utilize Apache Spark to create distibuted machine learning models.
MicrosoftML simplifies training and scoring classifiers and regressors, as well as facilitating the creation of models using the CNTK library, images, and text.