Wolfram Language & System Documentation Center

ArrowDataset

See Also
- Import
- Export
- CloudExport
- CloudImport
- Formats
- Parquet
- ORC
- ArrowIPC
- CSV
Related Guides
- See Also
  - Import
  - Export
  - CloudExport
  - CloudImport
  - Formats
  - Parquet
  - ORC
  - ArrowIPC
  - CSV
- Related Guides

ArrowDataset

Import and Export support ArrowDataset for the "Parquet", "ArrowIPC", "ORC", "CSV", and "TSV" formats.

Background & Context

- Efficient multi-file, column-oriented data format.
- Developed by the Apache Software Foundation.

Import & Export

Import["dir","ArrowDataset"] imports an ArrowDataset directory as a Tabular object.
Import["dir",{"ArrowDataset",elem,…}] imports the specified elements.
Import["dir",{"ArrowDataset",elem,subelem₁,…}] imports subelements subelem_i, useful for partial data import.
Export["dir",expr,"ArrowDataset"] creates an ArrowDataset directory from expr.
Supported expressions expr include:

	{v₁,v₂,…}	a single column of data
	{{v₁₁,v₁₂,…},{v₂₁,v₂₂,…},…}	lists of rows of data
	array	an array such as SparseArray, QuantityArray, etc.
	tseries	a TimeSeries, EventSeries or a TemporalData object
	dataset	a Dataset or a Tabular object

See the following reference pages for full general information:

	Import, Export	import from or export to a file
	CloudImport, CloudExport	import from or export to a cloud object
	ImportString, ExportString	import from or export to a string
	ImportByteArray, ExportByteArray	import from or export to a byte array

Import Elements

General Import elements:
"Elements" list of elements and options available in this file

"Summary" summary of the file

"Rules" list of rules for all available elements
Data representation elements:

	"Data"	two-dimensional array
	"Dataset"	table data as a Dataset
	"EventSeries"	table data as an EventSeries
	"Tabular"	a Tabular object
	"TimeSeries"	table data as a TimeSeries

Additional elements can be specified depending on the "Format" option. See "Parquet", "ArrowIPC", "ORC", "CSV", or "TSV" for detailed element descriptions.
Import by default uses the "Tabular" element.
Subelements for partial data import for the "Tabular" element can take row and column specifications in the form {"Tabular",rows,cols}, where rows and cols can be any of the following:

	n	n^th row or column
	-n	counts from the end
	n;;m	from n through m
	n;;m;;s	from n through m with steps of s
	{n₁,n₂,…}	specific rows or columns n_i

Column specifications can also be any of the following:
"col" single column "col"

{col₁,col₂,…} list of column names col_i
Data descriptor elements:
"ColumnLabels" names of columns

"ColumnTypes" association with data type for each column

"Schema" TabularSchema object

Options

General Import options:

"Format"	Automatic	underlying format to use
"Partitioning"	None	partitioning scheme
"TimeColumn"	Automatic	column to use for times in "EventSeries" and "TimeSeries" elements

General Export options:

"Format"	"Parquet"	underlying format to use
"MaxPartitions"	4096	maximal number of partitions
"MaxRowsPerFile"	Infinity	maximal number of rows per file
"NameTemplate"	"part{i}"	file name template
"Partitioning"	"Hive"	partitioning scheme
"SplitColumns"	Automatic	columns used for partitioning

Import supports the following settings for "Partitioning":

	None	no partitioning
	"Hive"	Hive partitioning
	{col₁,col₂,…}	directory partitioning with partition keys
	{"Directory", {col₁,col₂,…}}	directory partitioning with partition keys

Export supports the following settings for "Partitioning":
"Directory" directory partitioning

"Hive" Hive partitioning
Additional options can be specified depending on the "Format" option. See "Parquet", "ArrowIPC", "ORC", "CSV", or "TSV" for detailed options descriptions.

Examples

open all close all

Basic Examples (2)

Export Arrow dataset:

Wolfram Language code: tabular = Import["ExampleData/TreesOwnedByTheCityOfChampaign.csv", "Tabular"];

Wolfram Language code: tabular["ColumnKeys"]

Wolfram Language code: Export["tabular_dataset", tabular, "ArrowDataset", "SplitColumns" -> "Number of Trunks", OverwriteTarget -> True]

Import Arrow dataset:

Wolfram Language code: Import["tabular_dataset", "ArrowDataset", "Partitioning" -> "Hive"]

Scope (3)

Import (3)

Show all elements available in the file:

Wolfram Language code: Import["tabular_dataset", {"ArrowDataset", "Elements"}]

By default, a Tabular object is returned:

Wolfram Language code: Import["tabular_dataset", "ArrowDataset", "Partitioning" -> "Hive"]//TabularQ

Import column types:

Wolfram Language code: Import["tabular_dataset", {"ArrowDataset", "ColumnTypes"}, "Partitioning" -> "Hive"]

Import Elements (19)

"ColumnCount" (1)

Get the number of columns:

Wolfram Language code: Import["tabular_dataset", {"ArrowDataset", "ColumnCount"}, "Partitioning" -> "Hive"]

"ColumnLabels" (1)

Read column names:

Wolfram Language code: Import["tabular_dataset", {"ArrowDataset", "ColumnLabels"}, "Partitioning" -> "Hive"]

"ColumnTypes" (1)

Import column types:

Wolfram Language code: Import["tabular_dataset", {"ArrowDataset", "ColumnTypes"}, "Partitioning" -> "Hive"]

"Data" (3)

Get the data from a file:

Wolfram Language code: Import["tabular_dataset", {"ArrowDataset", "Data"}, "Partitioning" -> "Hive"]//Shallow

Import only selected rows:

Wolfram Language code: Import["tabular_dataset", {"ArrowDataset", "Data", 1 ;; 3}, "Partitioning" -> "Hive"]

Import only selected columns:

Wolfram Language code: Import["tabular_dataset", {"ArrowDataset", "Data", All, {1, 3}}, "Partitioning" -> "Hive"]//Shallow

Import only selected columns using column names:

Wolfram Language code:

Import["tabular_dataset", {"ArrowDataset", "Data", All, {"Tree Species", "Diameter at Breast Height (in Feet)"}}, "Partitioning" -> "Hive"]//Shallow

"Dataset" (3)

Get the data as a Dataset:

Wolfram Language code: Import["tabular_dataset", {"ArrowDataset", "Dataset"}, "Partitioning" -> "Hive"]

Import only selected rows:

Wolfram Language code: Import["tabular_dataset", {"ArrowDataset", "Dataset", 1 ;; 3}, "Partitioning" -> "Hive"]

Import only selected columns:

Wolfram Language code: Import["tabular_dataset", {"ArrowDataset", "Dataset", All, {1, 3}}, "Partitioning" -> "Hive"]

Import only selected columns using column names:

Wolfram Language code:

Import["tabular_dataset", {"ArrowDataset", "Dataset", All, {"Tree Species", "Diameter at Breast Height (in Feet)"}}, "Partitioning" -> "Hive"]

"Dimensions" (1)

Import data dimensions:

Wolfram Language code: Import["tabular_dataset", {"ArrowDataset", "Dimensions"}, "Partitioning" -> "Hive"]

"EventSeries" (1)

Export a Tabular object as an ArrowDataset:

Wolfram Language code:

Export["sales_dataset", ResourceData["Sample Tabular Data: Sales Data"], "ArrowDataset", "SplitColumns" -> "Product", OverwriteTarget -> True]

Import an ArrowDataset as an EventSeries:

Wolfram Language code: Import["sales_dataset", {"ArrowDataset", "EventSeries"}, "Partitioning" -> "Hive"]

Import a single row from an ArrowDataset:

Wolfram Language code: Import["sales_dataset", {"ArrowDataset", "EventSeries", 5}, "Partitioning" -> "Hive"]

Import some specific rows from an ArrowDataset:

Wolfram Language code: Import["sales_dataset", {"ArrowDataset", "EventSeries", {1, 5, 7}}, "Partitioning" -> "Hive"]

Import the first 10 rows of an ArrowDataset:

Wolfram Language code: Import["sales_dataset", {"ArrowDataset", "EventSeries", 1 ;; 10}, "Partitioning" -> "Hive"]

Import only selected columns using column names:

Wolfram Language code: Import["sales_dataset", {"ArrowDataset", "EventSeries", All, {"Product", "Date", "Quantity"}}, "Partitioning" -> "Hive"]

"MetaInformation" (1)

Import metadata:

Wolfram Language code: Import["tabular_dataset", {"ArrowDataset", "MetaInformation"}, "Partitioning" -> "Hive"]

"RowCount" (1)

Get the number of rows:

Wolfram Language code: Import["tabular_dataset", {"ArrowDataset", "RowCount"}, "Partitioning" -> "Hive"]

"Schema" (1)

Get the TabularSchema object:

Wolfram Language code: Import["tabular_dataset", {"ArrowDataset", "Schema"}, "Partitioning" -> "Hive"]

"Summary" (1)

Get the file summary:

Wolfram Language code: Import["tabular_dataset", {"ArrowDataset", "Summary"}, "Partitioning" -> "Hive"]

"Tabular" (3)

Get the data from a file as a Tabular object:

Wolfram Language code: Import["tabular_dataset", {"ArrowDataset", "Tabular"}, "Partitioning" -> "Hive"]

Import only selected rows:

Wolfram Language code: Import["tabular_dataset", {"ArrowDataset", "Tabular", 1 ;; 5}, "Partitioning" -> "Hive"]

Import only selected columns:

Wolfram Language code: Import["tabular_dataset", {"ArrowDataset", "Tabular", All, {1, 3}}, "Partitioning" -> "Hive"]

Import only selected columns using column names:

Wolfram Language code:

Import["tabular_dataset", {"ArrowDataset", "Tabular", All, {"Tree Species", "Diameter at Breast Height (in Feet)"}}, "Partitioning" -> "Hive"]

"TimeSeries" (1)

Export a Tabular object as an ArrowDataset:

Wolfram Language code:

Export["sales_dataset", ResourceData["Sample Tabular Data: Sales Data"], "ArrowDataset", "SplitColumns" -> "Product", OverwriteTarget -> True]

Import an ArrowDataset as a TimeSeries:

Wolfram Language code: Import["sales_dataset", {"ArrowDataset", "TimeSeries"}, "Partitioning" -> "Hive"]

Import a single row from an ArrowDataset:

Wolfram Language code: Import["sales_dataset", {"ArrowDataset", "TimeSeries", 5}, "Partitioning" -> "Hive"]

Import some specific rows from an ArrowDataset:

Wolfram Language code: Import["sales_dataset", {"ArrowDataset", "TimeSeries", {1, 5, 7}}, "Partitioning" -> "Hive"]

Import the first 10 rows of an ArrowDataset:

Wolfram Language code: Import["sales_dataset", {"ArrowDataset", "TimeSeries", 1 ;; 10}, "Partitioning" -> "Hive"]

Import only selected columns using column names:

Wolfram Language code: Import["sales_dataset", {"ArrowDataset", "TimeSeries", All, {"Product", "Date", "Quantity"}}, "Partitioning" -> "Hive"]

Import Options (3)

"Format" (1)

By default, the format of ArrowDataset is inferred from files stored in the input directory:

Wolfram Language code: FileSystemMap[FileSize, "tabular_dataset", Infinity]

Use "Format" option to specify underlying format to use:

Wolfram Language code: Import["tabular_dataset", "ArrowDataset", "Format" -> "Parquet", "Partitioning" -> "Hive"]

"Partitioning" (1)

By default, "Partitioning"None is used. Notice that the column used for partitioning is not imported:

Wolfram Language code: Import["tabular_dataset", "ArrowDataset"]//Dimensions

Use "Partitioning" option with correct setting to get all columns:

Wolfram Language code: Import["tabular_dataset", "ArrowDataset", "Partitioning" -> "Hive"]//Dimensions

"TimeColumn" (1)

Export a Tabular object to an ArrowDataset:

Wolfram Language code:

Export["date_dataset", Tabular[Association["RawSchema" -> Association["ColumnProperties" -> 
     Association["Date" -> Association["ElementType" -> TypeSpecifier["Date"]["Integer32", "Day", 
          "Gregorian", None]], "Value" -> Association["ElementType" -> "Real32"]], 
    "KeyColumns" -> None, "Backend" -> "WolframKernel"], "Options" -> {}, 
  "BackendData" -> Association["ColumnData" -> DataStructure["ColumnTable", 
      {{TabularColumn[Association["Data" -> {5, {{NumericArray[{13150, 13151, 13152, 13153, 13156}, 
               "Integer32"], {}, None}}, None}, "ElementType" -> "Date"["Integer32", "Day", 
            "Gregorian", None]]], TabularColumn[Association[
          "Data" -> {NumericArray[{11.819999694824219, 12.039999961853027, 12.09000015258789, 
             11.880000114440918, 12.430000305175781}, "Real32"], {}, None}, 
          "ElementType" -> "Real32"]]}}]]]], "ArrowDataset", "SplitColumns" -> "Date", OverwriteTarget -> True]

By default, the time column is selected automatically for "TimeSeries" and "EventSeries" elements:

Wolfram Language code: Import["date_dataset", {"ArrowDataset", "TimeSeries"}, "Partitioning" -> "Hive"]

Use the "TimeColumn" option to specify the time column:

Wolfram Language code: Import["date_dataset", {"ArrowDataset", "TimeSeries"}, "Partitioning" -> "Hive", "TimeColumn" -> "Value"]

Export Options (6)

"Format" (1)

By default, Export uses "Parquet" format:

Wolfram Language code: FileSystemMap[FileSize, "tabular_dataset", Infinity]

Use "ArrowIPC" format:

Wolfram Language code: tabular = Import["ExampleData/TreesOwnedByTheCityOfChampaign.csv", "Tabular"];

Wolfram Language code:

Export["tabular_dataset_arrowipc", tabular, "ArrowDataset", "SplitColumns" -> "Number of Trunks", "Format" -> "ArrowIPC"]

Wolfram Language code: FileSystemMap[FileSize, "tabular_dataset_arrowipc", Infinity]

"MaxPartitions" (1)

When the number of unique elements in the split column is larger than the default value of "MaxPartitions" option, then Export will fail:

Wolfram Language code: tabular = Import["ExampleData/TreesOwnedByTheCityOfChampaign.csv", "Tabular"];

Wolfram Language code: Export["tabular_dataset_location", tabular, "ArrowDataset", "SplitColumns" -> "Location"]

Increase allowed number of partitions:

Wolfram Language code:

Export["tabular_dataset_location", tabular, "ArrowDataset", "SplitColumns" -> "Location", "MaxPartitions" -> 27000, OverwriteTarget -> True]

"MaxRowsPerFile" (1)

By default, the number of rows per file is unlimited:

Wolfram Language code: tabular = Import["ExampleData/TreesOwnedByTheCityOfChampaign.csv", "Tabular"];

Wolfram Language code: Export["tabular_dataset", tabular, "ArrowDataset", "SplitColumns" -> "Number of Trunks", OverwriteTarget -> True]

Wolfram Language code: FileNames["*.*", "tabular_dataset", Infinity]//Length

Limit the number of rows per file:

Wolfram Language code:

Export["tabular_dataset_limit", tabular, "ArrowDataset", "SplitColumns" -> "Number of Trunks", "MaxRowsPerFile" -> 800, OverwriteTarget -> True]

Wolfram Language code: FileNames["*.*", "tabular_dataset_limit", Infinity]//Length

"NameTemplate" (1)

By default, "part{i}" is used as the name template for ArrowDataset files:

Wolfram Language code: tabular = Import["ExampleData/TreesOwnedByTheCityOfChampaign.csv", "Tabular"];

Wolfram Language code:

Export["tabular_dataset_limit", tabular, "ArrowDataset", "SplitColumns" -> "Number of Trunks", "MaxRowsPerFile" -> 5000, OverwriteTarget -> True]

Wolfram Language code: FileSystemMap[FileSize, "tabular_dataset_limit", Infinity]

Use different name template:

Wolfram Language code:

Export["tabular_dataset_myname", tabular, "ArrowDataset", "SplitColumns" -> "Number of Trunks", "MaxRowsPerFile" -> 5000, "NameTemplate" -> "myname{i}", OverwriteTarget -> True]

Wolfram Language code: FileSystemMap[FileSize, "tabular_dataset_myname", Infinity]

"Partitioning" (1)

By default, Export uses "Hive" partitioning:

Wolfram Language code: FileSystemMap[FileSize, "tabular_dataset", Infinity]

Use "Directory" partitioning:

Wolfram Language code: tabular = Import["ExampleData/TreesOwnedByTheCityOfChampaign.csv", "Tabular"];

Wolfram Language code:

Export["tabular_dataset_directory", tabular, "ArrowDataset", "SplitColumns" -> "Number of Trunks", "Partitioning" -> "Directory"]

Wolfram Language code: FileSystemMap[FileSize, "tabular_dataset_directory", Infinity]

"SplitColumns" (1)

Export requires "SplitColumns" option:

Wolfram Language code: tabular = Import["ExampleData/TreesOwnedByTheCityOfChampaign.csv", "Tabular"];

Wolfram Language code: Export["tabular_dataset", tabular, "ArrowDataset", OverwriteTarget -> True]

Only column keys from a Tabular object can be the values of "SplitColumns" option:

Wolfram Language code: tabular = Import["ExampleData/TreesOwnedByTheCityOfChampaign.csv", "Tabular"];

Wolfram Language code: tabular["ColumnKeys"]

Wolfram Language code: Export["tabular_dataset_split", tabular, "ArrowDataset", "SplitColumns" -> {"a", "b"}, OverwriteTarget -> True]

Wolfram Language code:

Export["tabular_dataset_split", tabular, "ArrowDataset", "SplitColumns" -> {"Number of Trunks", "Location Type"}, OverwriteTarget -> True]

Possible Issues (1)

Export requires "SplitColumns" option:

Wolfram Language code: tabular = Import["ExampleData/TreesOwnedByTheCityOfChampaign.csv", "Tabular"];

Wolfram Language code: Export["tabular_dataset", tabular, "ArrowDataset", OverwriteTarget -> True]

Wolfram Language code: Export["tabular_dataset", tabular, "ArrowDataset", "SplitColumns" -> "Number of Trunks", OverwriteTarget -> True]

Top

	"col"	single column "col"
	{col₁,col₂,…}	list of column names col_i

ArrowDataset

Background & Context

Import & Export

Import Elements

Options

Examples

Basic Examples (2)

Scope (3)

Import (3)

Import Elements (19)

"ColumnCount" (1)

"ColumnLabels" (1)

"ColumnTypes" (1)

"Data" (3)

"Dataset" (3)

"Dimensions" (1)

"EventSeries" (1)

"MetaInformation" (1)

"RowCount" (1)

"Schema" (1)

"Summary" (1)

"Tabular" (3)

"TimeSeries" (1)

Import Options (3)

"Format" (1)

"Partitioning" (1)

"TimeColumn" (1)

Export Options (6)

"Format" (1)

"MaxPartitions" (1)

"MaxRowsPerFile" (1)

"NameTemplate" (1)

"Partitioning" (1)

"SplitColumns" (1)

Possible Issues (1)

See Also

Related Guides

History