used as feature name prefix.
used as feature name prefix.
Featurize a single row.
Featurize a single row.
input row.
output indices.
output values.
this interface isn't very Scala-esce, but it avoids lots of allocation. Also due to SparseVector limitations we don't support 64bit indices (e.g. indices are signed 32bit ints)
input field index.
input field index.
Initialize hasher that already pre-hashes the column prefix.
Initialize hasher that already pre-hashes the column prefix.
bit mask applied to final hash.
pre-hashed namespace.
(?U) makes \w unicode aware https://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions we could follow https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html but that strips single character words...
(?U) makes \w unicode aware https://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions we could follow https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html but that strips single character words...
TODO: expose as user configurable parameter
Featurize strings by splitting into native VW structure. (hash(s(0)):value, hash(s(1)):value, ...)