ArrowDataset
Background & Context
-
- Efficient multi-file, column-oriented data format.
- Developed by the Apache Software Foundation.
Import & Export
- Import["dir","ArrowDataset"] imports an ArrowDataset directory as a Tabular object.
- Import["dir",{"ArrowDataset",elem,…}] imports the specified elements.
- Import["dir",{"ArrowDataset",elem,subelem1,…}] imports subelements subelemi, useful for partial data import.
- Export["dir",expr,"ArrowDataset"] creates an ArrowDataset directory from expr.
- Supported expressions expr include:
-
{v1,v2,…} a single column of data {{v11,v12,…},{v21,v22,…},…} lists of rows of data array an array such as SparseArray, QuantityArray, etc. tseries a TimeSeries, EventSeries or a TemporalData object dataset a Dataset or a Tabular object - See the following reference pages for full general information:
-
Import, Export import from or export to a file CloudImport, CloudExport import from or export to a cloud object ImportString, ExportString import from or export to a string ImportByteArray, ExportByteArray import from or export to a byte array
Import Elements
- General Import elements:
-
"Elements" list of elements and options available in this file "Summary" summary of the file "Rules" list of rules for all available elements - Data representation elements:
-
"Data" two-dimensional array "Dataset" table data as a Dataset "EventSeries" table data as an EventSeries "Tabular" a Tabular object "TimeSeries" table data as a TimeSeries - Additional elements can be specified depending on the "Format" option. See "Parquet", "ArrowIPC", "ORC", "CSV", or "TSV" for detailed element descriptions.
- Import by default uses the "Tabular" element.
- Subelements for partial data import for the "Tabular" element can take row and column specifications in the form {"Tabular",rows,cols}, where rows and cols can be any of the following:
-
n nth row or column -n counts from the end n;;m from n through m n;;m;;s from n through m with steps of s {n1,n2,…} specific rows or columns ni - Column specifications can also be any of the following:
-
"col" single column "col" {col1,col2,…} list of column names coli - Data descriptor elements:
-
"ColumnLabels" names of columns "ColumnTypes" association with data type for each column "Schema" TabularSchema object
Options
- General Import options:
-
"Format" Automatic underlying format to use "Partitioning" None partitioning scheme "TimeColumn" Automatic column to use for times in "EventSeries" and "TimeSeries" elements - General Export options:
-
"Format" "Parquet" underlying format to use "MaxPartitions" 4096 maximal number of partitions "MaxRowsPerFile" Infinity maximal number of rows per file "NameTemplate" "part{i}" file name template "Partitioning" "Hive" partitioning scheme "SplitColumns" Automatic columns used for partitioning - Import supports the following settings for "Partitioning":
-
None no partitioning "Hive" Hive partitioning {col1,col2,…} directory partitioning with partition keys {"Directory", {col1,col2,…}} directory partitioning with partition keys - Export supports the following settings for "Partitioning":
-
"Directory" directory partitioning "Hive" Hive partitioning - Additional options can be specified depending on the "Format" option. See "Parquet", "ArrowIPC", "ORC", "CSV", or "TSV" for detailed options descriptions.
Examples
open all close allBasic Examples (2)
tabular = Import["ExampleData/TreesOwnedByTheCityOfChampaign.csv", "Tabular"];tabular["ColumnKeys"]Export["tabular_dataset", tabular, "ArrowDataset", "SplitColumns" -> "Number of Trunks", OverwriteTarget -> True]Import["tabular_dataset", "ArrowDataset", "Partitioning" -> "Hive"]Scope (3)
Import (3)
Show all elements available in the file:
Import["tabular_dataset", {"ArrowDataset", "Elements"}]By default, a Tabular object is returned:
Import["tabular_dataset", "ArrowDataset", "Partitioning" -> "Hive"]//TabularQImport["tabular_dataset", {"ArrowDataset", "ColumnTypes"}, "Partitioning" -> "Hive"]Import Elements (19)
"ColumnCount" (1)
"ColumnLabels" (1)
"ColumnTypes" (1)
"Data" (3)
Import["tabular_dataset", {"ArrowDataset", "Data"}, "Partitioning" -> "Hive"]//ShallowImport["tabular_dataset", {"ArrowDataset", "Data", 1 ;; 3}, "Partitioning" -> "Hive"]Import["tabular_dataset", {"ArrowDataset", "Data", All, {1, 3}}, "Partitioning" -> "Hive"]//ShallowImport only selected columns using column names:
Import["tabular_dataset", {"ArrowDataset", "Data", All, {"Tree Species", "Diameter at Breast Height (in Feet)"}}, "Partitioning" -> "Hive"]//Shallow"Dataset" (3)
Get the data as a Dataset:
Import["tabular_dataset", {"ArrowDataset", "Dataset"}, "Partitioning" -> "Hive"]Import["tabular_dataset", {"ArrowDataset", "Dataset", 1 ;; 3}, "Partitioning" -> "Hive"]Import["tabular_dataset", {"ArrowDataset", "Dataset", All, {1, 3}}, "Partitioning" -> "Hive"]Import only selected columns using column names:
Import["tabular_dataset", {"ArrowDataset", "Dataset", All, {"Tree Species", "Diameter at Breast Height (in Feet)"}}, "Partitioning" -> "Hive"]"Dimensions" (1)
"EventSeries" (1)
Export a Tabular object as an ArrowDataset:
Export["sales_dataset", ResourceData["Sample Tabular Data: Sales Data"], "ArrowDataset", "SplitColumns" -> "Product", OverwriteTarget -> True]Import an ArrowDataset as an EventSeries:
Import["sales_dataset", {"ArrowDataset", "EventSeries"}, "Partitioning" -> "Hive"]Import a single row from an ArrowDataset:
Import["sales_dataset", {"ArrowDataset", "EventSeries", 5}, "Partitioning" -> "Hive"]Import some specific rows from an ArrowDataset:
Import["sales_dataset", {"ArrowDataset", "EventSeries", {1, 5, 7}}, "Partitioning" -> "Hive"]Import the first 10 rows of an ArrowDataset:
Import["sales_dataset", {"ArrowDataset", "EventSeries", 1 ;; 10}, "Partitioning" -> "Hive"]Import only selected columns using column names:
Import["sales_dataset", {"ArrowDataset", "EventSeries", All, {"Product", "Date", "Quantity"}}, "Partitioning" -> "Hive"]"MetaInformation" (1)
"RowCount" (1)
"Schema" (1)
Get the TabularSchema object:
Import["tabular_dataset", {"ArrowDataset", "Schema"}, "Partitioning" -> "Hive"]"Summary" (1)
"Tabular" (3)
Get the data from a file as a Tabular object:
Import["tabular_dataset", {"ArrowDataset", "Tabular"}, "Partitioning" -> "Hive"]Import["tabular_dataset", {"ArrowDataset", "Tabular", 1 ;; 5}, "Partitioning" -> "Hive"]Import["tabular_dataset", {"ArrowDataset", "Tabular", All, {1, 3}}, "Partitioning" -> "Hive"]Import only selected columns using column names:
Import["tabular_dataset", {"ArrowDataset", "Tabular", All, {"Tree Species", "Diameter at Breast Height (in Feet)"}}, "Partitioning" -> "Hive"]"TimeSeries" (1)
Export a Tabular object as an ArrowDataset:
Export["sales_dataset", ResourceData["Sample Tabular Data: Sales Data"], "ArrowDataset", "SplitColumns" -> "Product", OverwriteTarget -> True]Import an ArrowDataset as a TimeSeries:
Import["sales_dataset", {"ArrowDataset", "TimeSeries"}, "Partitioning" -> "Hive"]Import a single row from an ArrowDataset:
Import["sales_dataset", {"ArrowDataset", "TimeSeries", 5}, "Partitioning" -> "Hive"]Import some specific rows from an ArrowDataset:
Import["sales_dataset", {"ArrowDataset", "TimeSeries", {1, 5, 7}}, "Partitioning" -> "Hive"]Import the first 10 rows of an ArrowDataset:
Import["sales_dataset", {"ArrowDataset", "TimeSeries", 1 ;; 10}, "Partitioning" -> "Hive"]Import only selected columns using column names:
Import["sales_dataset", {"ArrowDataset", "TimeSeries", All, {"Product", "Date", "Quantity"}}, "Partitioning" -> "Hive"]Import Options (3)
"Format" (1)
By default, the format of ArrowDataset is inferred from files stored in the input directory:
FileSystemMap[FileSize, "tabular_dataset", Infinity]Use "Format" option to specify underlying format to use:
Import["tabular_dataset", "ArrowDataset", "Format" -> "Parquet", "Partitioning" -> "Hive"]"Partitioning" (1)
By default, "Partitioning"None is used. Notice that the column used for partitioning is not imported:
Import["tabular_dataset", "ArrowDataset"]//DimensionsUse "Partitioning" option with correct setting to get all columns:
Import["tabular_dataset", "ArrowDataset", "Partitioning" -> "Hive"]//Dimensions"TimeColumn" (1)
Export a Tabular object to an ArrowDataset:
Export["date_dataset", Tabular[Association["RawSchema" -> Association["ColumnProperties" ->
Association["Date" -> Association["ElementType" -> TypeSpecifier["Date"]["Integer32", "Day",
"Gregorian", None]], "Value" -> Association["ElementType" -> "Real32"]],
"KeyColumns" -> None, "Backend" -> "WolframKernel"], "Options" -> {},
"BackendData" -> Association["ColumnData" -> DataStructure["ColumnTable",
{{TabularColumn[Association["Data" -> {5, {{NumericArray[{13150, 13151, 13152, 13153, 13156},
"Integer32"], {}, None}}, None}, "ElementType" -> "Date"["Integer32", "Day",
"Gregorian", None]]], TabularColumn[Association[
"Data" -> {NumericArray[{11.819999694824219, 12.039999961853027, 12.09000015258789,
11.880000114440918, 12.430000305175781}, "Real32"], {}, None},
"ElementType" -> "Real32"]]}}]]]], "ArrowDataset", "SplitColumns" -> "Date", OverwriteTarget -> True]By default, the time column is selected automatically for "TimeSeries" and "EventSeries" elements:
Import["date_dataset", {"ArrowDataset", "TimeSeries"}, "Partitioning" -> "Hive"]Use the "TimeColumn" option to specify the time column:
Import["date_dataset", {"ArrowDataset", "TimeSeries"}, "Partitioning" -> "Hive", "TimeColumn" -> "Value"]Export Options (6)
"Format" (1)
By default, Export uses "Parquet" format:
FileSystemMap[FileSize, "tabular_dataset", Infinity]tabular = Import["ExampleData/TreesOwnedByTheCityOfChampaign.csv", "Tabular"];Export["tabular_dataset_arrowipc", tabular, "ArrowDataset", "SplitColumns" -> "Number of Trunks", "Format" -> "ArrowIPC"]FileSystemMap[FileSize, "tabular_dataset_arrowipc", Infinity]"MaxPartitions" (1)
When the number of unique elements in the split column is larger than the default value of "MaxPartitions" option, then Export will fail:
tabular = Import["ExampleData/TreesOwnedByTheCityOfChampaign.csv", "Tabular"];Export["tabular_dataset_location", tabular, "ArrowDataset", "SplitColumns" -> "Location"]Increase allowed number of partitions:
Export["tabular_dataset_location", tabular, "ArrowDataset", "SplitColumns" -> "Location", "MaxPartitions" -> 27000, OverwriteTarget -> True]"MaxRowsPerFile" (1)
By default, the number of rows per file is unlimited:
tabular = Import["ExampleData/TreesOwnedByTheCityOfChampaign.csv", "Tabular"];Export["tabular_dataset", tabular, "ArrowDataset", "SplitColumns" -> "Number of Trunks", OverwriteTarget -> True]FileNames["*.*", "tabular_dataset", Infinity]//LengthLimit the number of rows per file:
Export["tabular_dataset_limit", tabular, "ArrowDataset", "SplitColumns" -> "Number of Trunks", "MaxRowsPerFile" -> 800, OverwriteTarget -> True]FileNames["*.*", "tabular_dataset_limit", Infinity]//Length"NameTemplate" (1)
By default, "part{i}" is used as the name template for ArrowDataset files:
tabular = Import["ExampleData/TreesOwnedByTheCityOfChampaign.csv", "Tabular"];Export["tabular_dataset_limit", tabular, "ArrowDataset", "SplitColumns" -> "Number of Trunks", "MaxRowsPerFile" -> 5000, OverwriteTarget -> True]FileSystemMap[FileSize, "tabular_dataset_limit", Infinity]Export["tabular_dataset_myname", tabular, "ArrowDataset", "SplitColumns" -> "Number of Trunks", "MaxRowsPerFile" -> 5000, "NameTemplate" -> "myname{i}", OverwriteTarget -> True]FileSystemMap[FileSize, "tabular_dataset_myname", Infinity]"Partitioning" (1)
By default, Export uses "Hive" partitioning:
FileSystemMap[FileSize, "tabular_dataset", Infinity]tabular = Import["ExampleData/TreesOwnedByTheCityOfChampaign.csv", "Tabular"];Export["tabular_dataset_directory", tabular, "ArrowDataset", "SplitColumns" -> "Number of Trunks", "Partitioning" -> "Directory"]FileSystemMap[FileSize, "tabular_dataset_directory", Infinity]"SplitColumns" (1)
Export requires "SplitColumns" option:
tabular = Import["ExampleData/TreesOwnedByTheCityOfChampaign.csv", "Tabular"];Export["tabular_dataset", tabular, "ArrowDataset", OverwriteTarget -> True]Only column keys from a Tabular object can be the values of "SplitColumns" option:
tabular = Import["ExampleData/TreesOwnedByTheCityOfChampaign.csv", "Tabular"];tabular["ColumnKeys"]Export["tabular_dataset_split", tabular, "ArrowDataset", "SplitColumns" -> {"a", "b"}, OverwriteTarget -> True]Export["tabular_dataset_split", tabular, "ArrowDataset", "SplitColumns" -> {"Number of Trunks", "Location Type"}, OverwriteTarget -> True]Possible Issues (1)
Export requires "SplitColumns" option:
tabular = Import["ExampleData/TreesOwnedByTheCityOfChampaign.csv", "Tabular"];Export["tabular_dataset", tabular, "ArrowDataset", OverwriteTarget -> True]Export["tabular_dataset", tabular, "ArrowDataset", "SplitColumns" -> "Number of Trunks", OverwriteTarget -> True]History
Introduced in 2025 (14.2) | Updated in 2026 (15.0)