Wolfram Language & System Documentation Center

"Tokens" (Net Encoder)

See Also
- NetEncoder
- NetDecoder
- NetChain
- NetGraph
- TextElement
- Net Encoders
- Class
- Characters
- SubwordTokens
- Net Decoders
- Tokens
- Characters
- Class
- SubwordTokens
Related Guides
- Neural Networks
Tech Notes
- Neural Networks in the Wolfram Language
- See Also
  - NetEncoder
  - NetDecoder
  - NetChain
  - NetGraph
  - TextElement
  - Net Encoders
  - Class
  - Characters
  - SubwordTokens
  - Net Decoders
  - Tokens
  - Characters
  - Class
  - SubwordTokens
- Related Guides
  - Neural Networks
- Tech Notes
  - Neural Networks in the Wolfram Language

"Tokens" (Net Encoder)

represents an encoder that converts the words in a string to a sequence of integer codes using a standard English vocabulary.

NetEncoder[{"Tokens","language"}]

represents an encoder that uses a standard vocabulary for the given language.

NetEncoder[{"Tokens",{token₁,token₂,…}}]

represents an encoder that uses a specified list of tokens as the vocabulary.

NetEncoder[{"Tokens",…,"param"value}]

represents an encoder in which additional parameters have been specified.

Details

NetEncoder[…][input] applies the encoder to an input to produce an output.
NetEncoder[…][{input₁,input₂,…}] applies the encoder to a list of strings to produce a list of outputs.
The input to the encoder must be a string or a TextElement with a sequence of strings that represents tokens. If it is a string, the segmentation into tokens will be done using a regular expression based on the value of "SplitPattern".
The output of the encoder is a sequence of integers between 1 and d+1, where d is the number of tokens in the vocabulary. The integer d+1 is used to signify tokens in the input that do not occur in the dictionary.
The type of the output NumericArray is the smallest unsigned integer that can represent all possible output integer values.
An encoder can be attached to an input port of a net by specifying "port"->NetEncoder[…] when constructing the net.

Parameters

The following parameters can be specified:

"IgnoreCase"	True	whether to ignore case when matching tokens from the string
"SplitPattern"		the string pattern to use in order to split the input string into tokens
"TargetLength"	All	the length of the final sequence to crop or pad to

With the parameter "IgnoreCase"->True, tokens are effectively converted to lowercase before encoding.
With the parameter "TargetLength"->All, all tokens found in the input string are encoded.
With the parameter "TargetLength"->n, the first n tokens found in the input string are encoded, with padding applied if fewer than n tokens are found. The padding value is d+1, where d is the number of tokens in the vocabulary.
With the parameter "SplitPattern"->None, the input to the encoder is assumed to be a pre-tokenized list of strings of the form {"token₁","token₂",…}.

Examples

open all close all

Basic Examples (1)

Create a token encoder for English text:

Wolfram Language code: enc = NetEncoder["Tokens"]

Encode an English sentence:

Wolfram Language code: enc["hello world"]

Out-of-vocabulary words are encoded as the maximum code:

Wolfram Language code: enc["orbus tertius"]

By default, words are detected using a simple regular expression:

Wolfram Language code: enc["A matter-of-fact"]

The list of words can be explicitly passed using TextElement:

Wolfram Language code: enc[TextElement[{"A", "matter-of-fact"}]]

Scope (6)

Use the default token encoder to encode a sentence:

Wolfram Language code: enc = NetEncoder["Tokens"]

Wolfram Language code: enc["hello world"]

Give a specific list of tokens:

Wolfram Language code: enc = NetEncoder[{"Tokens", {"rock", "paper", "scissors"}}]

Wolfram Language code: enc["rock rock paper rock scissors"]

Give a specific list of tokens, including a split pattern:

Wolfram Language code: enc = NetEncoder[{"Tokens", {"A", "B", "C"}, ","}]

Wolfram Language code: enc["A,B,C,B,A"]

Specify that the sequence should be padded or trimmed to be 4 elements long:

Wolfram Language code: enc = NetEncoder[{"Tokens", {"rock", "paper", "scissors"}, "TargetLength" -> 4}]

Wolfram Language code: enc["rock rock paper rock rock rock"]

Wolfram Language code: enc["paper"]

Use a built-in dictionary for a specific language:

Wolfram Language code: enc = NetEncoder[{"Tokens", "French"}]

Wolfram Language code: enc["Bonjour le monde"]

Use a custom tokenization with TextElement:

Wolfram Language code: tokens = TextElement@TextCases["As a matter-of-fact, my mother-in-law is in N.Y.C", "Word" | "Punctuation"]

Wolfram Language code: NetEncoder["Tokens"][tokens]

Use the output of TextStructure to compute a list of token indices:

Wolfram Language code: partOfSpeech = TextStructure["As a matter-of-fact, my mother-in-law is in N.Y.C", "PartsOfSpeech"]

Wolfram Language code: NetEncoder["Tokens"][partOfSpeech]

A tree structure gets flattened:

Wolfram Language code: constituentTree = TextStructure["As a matter-of-fact, my mother-in-law is in N.Y.C"]

Wolfram Language code: NetEncoder["Tokens"][constituentTree]

Parameters (3)

"IgnoreCase" (1)

An encoder with "IgnoreCase"->True treats tokens that differ only by the case of their constituent characters as equivalent:

Wolfram Language code: NetEncoder[{"Tokens", "English", "IgnoreCase" -> True}]["Hello hello"]

An encoder with "IgnoreCase"->False does not do this:

Wolfram Language code: NetEncoder[{"Tokens", "English", "IgnoreCase" -> False}]["Hello hello"]

"SplitPattern" (2)

Create an encoder that isolates digit characters, using "SplitPattern":

Wolfram Language code: enc = NetEncoder[{"Tokens", "English", "SplitPattern" -> {WordBoundary, x : DigitCharacter :> x}}]

The encoder outputs one token for each digit character:

Wolfram Language code: enc["hello world 0123"]

It is different from the default behavior, which gathers all consecutive digit characters together:

Wolfram Language code: NetEncoder["Tokens"]["hello world 0123"]

Create an encoder with "SplitPattern"->None and two tokens:

Wolfram Language code: enc = NetEncoder[{"Tokens", {"hello", "world"}, "SplitPattern" -> None}]

The encoder now expects a list of tokens as input:

Wolfram Language code: enc["hello"]

Wolfram Language code: enc[{"hello", "world"}]

The encoder still maps across a batch of examples:

Wolfram Language code: enc[{{"hello", "world"}, {"hello", "hello", "world"}}]

Top

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read

"Tokens" (Net Encoder)

Details

Parameters

Examples

Basic Examples (1)

Scope (6)

Parameters (3)

"IgnoreCase" (1)

"SplitPattern" (2)

"Tokens" (Net Encoder)

Details

Parameters

Examples

Basic Examples (1)

Scope (6)

Parameters (3)

"IgnoreCase" (1)

"SplitPattern" (2)

See Also

Tech Notes

Related Guides

History