-
See Also
- NetEncoder
- NetDecoder
- NetChain
- NetGraph
- TextElement
-
- Net Encoders
- Class
- Characters
- SubwordTokens
-
- Net Decoders
- Tokens
- Characters
- Class
- SubwordTokens
- Related Guides
- Tech Notes
"Tokens" (Net Encoder)
NetEncoder["Tokens"]
represents an encoder that converts the words in a string to a sequence of integer codes using a standard English vocabulary.
NetEncoder[{"Tokens","language"}]
represents an encoder that uses a standard vocabulary for the given language.
NetEncoder[{"Tokens",{token1,token2,…}}]
represents an encoder that uses a specified list of tokens as the vocabulary.
NetEncoder[{"Tokens",…,"param"value}]
represents an encoder in which additional parameters have been specified.
Details
- NetEncoder[…][input] applies the encoder to an input to produce an output.
- NetEncoder[…][{input1,input2,…}] applies the encoder to a list of strings to produce a list of outputs.
- The input to the encoder must be a string or a TextElement with a sequence of strings that represents tokens. If it is a string, the segmentation into tokens will be done using a regular expression based on the value of "SplitPattern".
- The output of the encoder is a sequence of integers between 1 and d+1, where d is the number of tokens in the vocabulary. The integer d+1 is used to signify tokens in the input that do not occur in the dictionary.
- The type of the output NumericArray is the smallest unsigned integer that can represent all possible output integer values.
- An encoder can be attached to an input port of a net by specifying "port"->NetEncoder[…] when constructing the net.
- The following parameters can be specified:
-
"IgnoreCase" True whether to ignore case when matching tokens from the string "SplitPattern" ![TemplateBox[{WordBoundary, paclet:ref/WordBoundary}, RefLink, BaseStyle -> {3ColumnTableMod}] TemplateBox[{WordBoundary, paclet:ref/WordBoundary}, RefLink, BaseStyle -> {3ColumnTableMod}]](Files/Tokens.en/1.png)
the string pattern to use in order to split the input string into tokens "TargetLength" All the length of the final sequence to crop or pad to - With the parameter "IgnoreCase"->True, tokens are effectively converted to lowercase before encoding.
- With the parameter "TargetLength"->All, all tokens found in the input string are encoded.
- With the parameter "TargetLength"->n, the first n tokens found in the input string are encoded, with padding applied if fewer than n tokens are found. The padding value is d+1, where d is the number of tokens in the vocabulary.
- With the parameter "SplitPattern"->None, the input to the encoder is assumed to be a pre-tokenized list of strings of the form {"token1","token2",…}.
Parameters
Examples
open all close allBasic Examples (1)
Create a token encoder for English text:
enc = NetEncoder["Tokens"]enc["hello world"]Out-of-vocabulary words are encoded as the maximum code:
enc["orbus tertius"]By default, words are detected using a simple regular expression:
enc["A matter-of-fact"]The list of words can be explicitly passed using TextElement:
enc[TextElement[{"A", "matter-of-fact"}]]Scope (6)
Use the default token encoder to encode a sentence:
enc = NetEncoder["Tokens"]enc["hello world"]Give a specific list of tokens:
enc = NetEncoder[{"Tokens", {"rock", "paper", "scissors"}}]enc["rock rock paper rock scissors"]Give a specific list of tokens, including a split pattern:
enc = NetEncoder[{"Tokens", {"A", "B", "C"}, ","}]enc["A,B,C,B,A"]Specify that the sequence should be padded or trimmed to be 4 elements long:
enc = NetEncoder[{"Tokens", {"rock", "paper", "scissors"}, "TargetLength" -> 4}]enc["rock rock paper rock rock rock"]enc["paper"]Use a built-in dictionary for a specific language:
enc = NetEncoder[{"Tokens", "French"}]enc["Bonjour le monde"]Use a custom tokenization with TextElement:
tokens = TextElement@TextCases["As a matter-of-fact, my mother-in-law is in N.Y.C", "Word" | "Punctuation"]NetEncoder["Tokens"][tokens]Use the output of TextStructure to compute a list of token indices:
partOfSpeech = TextStructure["As a matter-of-fact, my mother-in-law is in N.Y.C", "PartsOfSpeech"]NetEncoder["Tokens"][partOfSpeech]A tree structure gets flattened:
constituentTree = TextStructure["As a matter-of-fact, my mother-in-law is in N.Y.C"]NetEncoder["Tokens"][constituentTree]Parameters (3)
"IgnoreCase" (1)
An encoder with "IgnoreCase"->True treats tokens that differ only by the case of their constituent characters as equivalent:
NetEncoder[{"Tokens", "English", "IgnoreCase" -> True}]["Hello hello"]An encoder with "IgnoreCase"->False does not do this:
NetEncoder[{"Tokens", "English", "IgnoreCase" -> False}]["Hello hello"]"SplitPattern" (2)
Create an encoder that isolates digit characters, using "SplitPattern":
enc = NetEncoder[{"Tokens", "English", "SplitPattern" -> {WordBoundary, x : DigitCharacter :> x}}]The encoder outputs one token for each digit character:
enc["hello world 0123"]It is different from the default behavior, which gathers all consecutive digit characters together:
NetEncoder["Tokens"]["hello world 0123"]Create an encoder with "SplitPattern"->None and two tokens:
enc = NetEncoder[{"Tokens", {"hello", "world"}, "SplitPattern" -> None}]The encoder now expects a list of tokens as input:
enc["hello"]enc[{"hello", "world"}]The encoder still maps across a batch of examples:
enc[{{"hello", "world"}, {"hello", "hello", "world"}}]See Also
NetEncoder NetDecoder NetChain NetGraph TextElement
Net Encoders: Class Characters SubwordTokens
Net Decoders: Tokens Characters Class SubwordTokens