public interface BagOfWordsTransform
A bag of words transform represents taking a list of words
and converting it to a vector where that vector is
of length number of vocab words.
Vocab words are determined by what is passed in to the transform via a constructor generally.
To build a vocab in NLP, you crawl a corpus with a tokenizer tracking word frequencies.
Any words above a specified frequency are added to an ordered list.
When using this ordered list in NLP pipelines (at least for bag of words)
you perform a lookup for each word in a string (determined by a tokenizer)
and fill in the appropriate weight (a word count or tfidf weight generally)
to represent the word at a particular column.
The column is determined by the ordered list of words.
The vocab words in the transform.
This is the words that were accumulated
when building a vocabulary.
(This is generally associated with some form of
mininmum words frequency scanning to build a vocab
you then map on to a list of vocab words as a list)