Two-dimensional Datasets

Record Batches

class arrow::RecordBatch

Collection of equal-length arrays matching a particular Schema.

A record batch is table-like data structure that is semantically a sequence of fields, each a contiguous Arrow array

Public Functions

Result<std::shared_ptr<StructArray>> ToStructArray() const

Convert record batch to struct array.

Create a struct array whose child arrays are the record batch’s columns. Note that the record batch’s top-level field metadata cannot be reflected in the resulting struct array.

bool Equals(const RecordBatch &other, bool check_metadata = false) const

Determine if two record batches are exactly equal.

Return

true if batches are equal

Parameters
  • [in] other: the RecordBatch to compare with

  • [in] check_metadata: if true, check that Schema metadata is the same

bool ApproxEquals(const RecordBatch &other) const

Determine if two record batches are approximately equal.

std::shared_ptr<Schema> schema() const

Return

true if batches are equal

std::vector<std::shared_ptr<Array>> columns() const

Retrieve all columns at once.

std::shared_ptr<Array> column(int i) const = 0

Retrieve an array from the record batch.

Return

an Array object

Parameters
  • [in] i: field index, does not boundscheck

std::shared_ptr<Array> GetColumnByName(const std::string &name) const

Retrieve an array from the record batch.

Return

an Array or null if no field was found

Parameters
  • [in] name: field name

std::shared_ptr<ArrayData> column_data(int i) const = 0

Retrieve an array’s internal data from the record batch.

Return

an internal ArrayData object

Parameters
  • [in] i: field index, does not boundscheck

ArrayDataVector column_data() const = 0

Retrieve all arrays’ internal data from the record batch.

Result<std::shared_ptr<RecordBatch>> AddColumn(int i, const std::shared_ptr<Field> &field, const std::shared_ptr<Array> &column) const = 0

Add column to the record batch, producing a new RecordBatch.

Parameters
  • [in] i: field index, which will be boundschecked

  • [in] field: field to be added

  • [in] column: column to be added

Result<std::shared_ptr<RecordBatch>> AddColumn(int i, std::string field_name, const std::shared_ptr<Array> &column) const

Add new nullable column to the record batch, producing a new RecordBatch.

For non-nullable columns, use the Field-based version of this method.

Parameters
  • [in] i: field index, which will be boundschecked

  • [in] field_name: name of field to be added

  • [in] column: column to be added

Result<std::shared_ptr<RecordBatch>> RemoveColumn(int i) const = 0

Remove column from the record batch, producing a new RecordBatch.

Parameters
  • [in] i: field index, does boundscheck

const std::string &column_name(int i) const

Name in i-th column.

int num_columns() const

Return

the number of columns in the table

int64_t num_rows() const

Return

the number of rows (the corresponding length of each column)

std::shared_ptr<RecordBatch> Slice(int64_t offset) const

Slice each of the arrays in the record batch.

Return

new record batch

Parameters
  • [in] offset: the starting offset to slice, through end of batch

std::shared_ptr<RecordBatch> Slice(int64_t offset, int64_t length) const = 0

Slice each of the arrays in the record batch.

Return

new record batch

Parameters
  • [in] offset: the starting offset to slice

  • [in] length: the number of elements to slice from offset

std::string ToString() const

Return

PrettyPrint representation suitable for debugging

Status Validate() const

Perform cheap validation checks to determine obvious inconsistencies within the record batch’s schema and internal data.

This is O(k) where k is the total number of fields and array descendents.

Return

Status

Status ValidateFull() const

Perform extensive validation checks to determine inconsistencies within the record batch’s schema and internal data.

This is potentially O(k*n) where n is the number of rows.

Return

Status

Public Static Functions

std::shared_ptr<RecordBatch> Make(std::shared_ptr<Schema> schema, int64_t num_rows, std::vector<std::shared_ptr<Array>> columns)

Parameters
  • [in] schema: The record batch schema

  • [in] num_rows: length of fields in the record batch. Each array should have the same length as num_rows

  • [in] columns: the record batch fields as vector of arrays

std::shared_ptr<RecordBatch> Make(std::shared_ptr<Schema> schema, int64_t num_rows, std::vector<std::shared_ptr<ArrayData>> columns)

Construct record batch from vector of internal data structures.

This class is intended for internal use, or advanced users.

Since

0.5.0

Parameters
  • schema: the record batch schema

  • num_rows: the number of semantic rows in the record batch. This should be equal to the length of each field

  • columns: the data for the batch’s columns

Result<std::shared_ptr<RecordBatch>> FromStructArray(const std::shared_ptr<Array> &array)

Construct record batch from struct array.

This constructs a record batch using the child arrays of the given array, which must be a struct array. Note that the struct array’s own null bitmap is not reflected in the resulting record batch.

class arrow::RecordBatchReader

Abstract interface for reading stream of record batches.

Subclassed by arrow::csv::StreamingReader, arrow::ipc::RecordBatchStreamReader, arrow::py::PyRecordBatchReader, arrow::TableBatchReader

Public Functions

std::shared_ptr<Schema> schema() const = 0

Return

the shared schema of the record batches in the stream

Status ReadNext(std::shared_ptr<RecordBatch> *batch) = 0

Read the next record batch in the stream.

Return null for batch when reaching end of stream

Return

Status

Parameters
  • [out] batch: the next loaded batch, null at end of stream

Result<std::shared_ptr<RecordBatch>> Next()

Iterator interface.

Status ReadAll(RecordBatchVector *batches)

Consume entire stream as a vector of record batches.

Status ReadAll(std::shared_ptr<Table> *table)

Read all batches and concatenate as arrow::Table.

Public Static Functions

Result<std::shared_ptr<RecordBatchReader>> Make(RecordBatchVector batches, std::shared_ptr<Schema> schema = NULLPTR)

Create a RecordBatchReader from a vector of RecordBatch.

Parameters
  • [in] batches: the vector of RecordBatch to read from

  • [in] schema: schema to conform to. Will be inferred from the first element if not provided.

class arrow::TableBatchReader : public arrow::RecordBatchReader

Compute a stream of record batches from a (possibly chunked) Table.

The conversion is zero-copy: each record batch is a view over a slice of the table’s columns.

Public Functions

TableBatchReader(const Table &table)

Construct a TableBatchReader for the given table.

std::shared_ptr<Schema> schema() const override

Return

the shared schema of the record batches in the stream

Status ReadNext(std::shared_ptr<RecordBatch> *out) override

Read the next record batch in the stream.

Return null for batch when reaching end of stream

Return

Status

Parameters
  • [out] batch: the next loaded batch, null at end of stream

void set_chunksize(int64_t chunksize)

Set the desired maximum chunk size of record batches.

The actual chunk size of each record batch may be smaller, depending on actual chunking characteristics of each table column.

Tables

class arrow::Table

Logical table as sequence of chunked arrays.

Public Functions

std::shared_ptr<Schema> schema() const

Return the table schema.

std::shared_ptr<ChunkedArray> column(int i) const = 0

Return a column by index.

std::vector<std::shared_ptr<ChunkedArray>> columns() const

Return vector of all columns for table.

std::shared_ptr<Field> field(int i) const

Return a column’s field by index.

std::vector<std::shared_ptr<Field>> fields() const

Return vector of all fields for table.

std::shared_ptr<Table> Slice(int64_t offset, int64_t length) const = 0

Construct a zero-copy slice of the table with the indicated offset and length.

Return

a new object wrapped in std::shared_ptr<Table>

Parameters
  • [in] offset: the index of the first row in the constructed slice

  • [in] length: the number of rows of the slice. If there are not enough rows in the table, the length will be adjusted accordingly

std::shared_ptr<Table> Slice(int64_t offset) const

Slice from first row at offset until end of the table.

std::shared_ptr<ChunkedArray> GetColumnByName(const std::string &name) const

Return a column by name.

Return

an Array or null if no field was found

Parameters
  • [in] name: field name

Result<std::shared_ptr<Table>> RemoveColumn(int i) const = 0

Remove column from the table, producing a new Table.

Result<std::shared_ptr<Table>> AddColumn(int i, std::shared_ptr<Field> field_arg, std::shared_ptr<ChunkedArray> column) const = 0

Add column to the table, producing a new Table.

Result<std::shared_ptr<Table>> SetColumn(int i, std::shared_ptr<Field> field_arg, std::shared_ptr<ChunkedArray> column) const = 0

Replace a column in the table, producing a new Table.

std::vector<std::string> ColumnNames() const

Return names of all columns.

Result<std::shared_ptr<Table>> RenameColumns(const std::vector<std::string> &names) const

Rename columns with provided names.

Result<std::shared_ptr<Table>> SelectColumns(const std::vector<int> &indices) const

Return new table with specified columns.

std::shared_ptr<Table> ReplaceSchemaMetadata(const std::shared_ptr<const KeyValueMetadata> &metadata) const = 0

Replace schema key-value metadata with new metadata (EXPERIMENTAL)

Since

0.5.0

Return

new Table

Parameters
  • [in] metadata: new KeyValueMetadata

Result<std::shared_ptr<Table>> Flatten(MemoryPool *pool = default_memory_pool()) const = 0

Flatten the table, producing a new Table.

Any column with a struct type will be flattened into multiple columns

Parameters
  • [in] pool: The pool for buffer allocations, if any

std::string ToString() const

Return

PrettyPrint representation suitable for debugging

Status Validate() const = 0

Perform cheap validation checks to determine obvious inconsistencies within the table’s schema and internal data.

This is O(k*m) where k is the total number of field descendents, and m is the number of chunks.

Return

Status

Status ValidateFull() const = 0

Perform extensive validation checks to determine inconsistencies within the table’s schema and internal data.

This is O(k*n) where k is the total number of field descendents, and n is the number of rows.

Return

Status

int num_columns() const

Return the number of columns in the table.

int64_t num_rows() const

Return the number of rows (equal to each column’s logical length)

bool Equals(const Table &other, bool check_metadata = false) const

Determine if tables are equal.

Two tables can be equal only if they have equal schemas. However, they may be equal even if they have different chunkings.

Result<std::shared_ptr<Table>> CombineChunks(MemoryPool *pool = default_memory_pool()) const

Make a new table by combining the chunks this table has.

All the underlying chunks in the ChunkedArray of each column are concatenated into zero or one chunk.

Parameters
  • [in] pool: The pool for buffer allocations

Public Static Functions

std::shared_ptr<Table> Make(std::shared_ptr<Schema> schema, std::vector<std::shared_ptr<ChunkedArray>> columns, int64_t num_rows = -1)

Construct a Table from schema and columns.

If columns is zero-length, the table’s number of rows is zero

Parameters
  • [in] schema: The table schema (column types)

  • [in] columns: The table’s columns as chunked arrays

  • [in] num_rows: number of rows in table, -1 (default) to infer from columns

std::shared_ptr<Table> Make(std::shared_ptr<Schema> schema, const std::vector<std::shared_ptr<Array>> &arrays, int64_t num_rows = -1)

Construct a Table from schema and arrays.

Parameters
  • [in] schema: The table schema (column types)

  • [in] arrays: The table’s columns as arrays

  • [in] num_rows: number of rows in table, -1 (default) to infer from columns

Result<std::shared_ptr<Table>> FromRecordBatchReader(RecordBatchReader *reader)

Construct a Table from a RecordBatchReader.

Parameters

Result<std::shared_ptr<Table>> FromRecordBatches(const std::vector<std::shared_ptr<RecordBatch>> &batches)

Construct a Table from RecordBatches, using schema supplied by the first RecordBatch.

Parameters
  • [in] batches: a std::vector of record batches

Result<std::shared_ptr<Table>> FromRecordBatches(std::shared_ptr<Schema> schema, const std::vector<std::shared_ptr<RecordBatch>> &batches)

Construct a Table from RecordBatches, using supplied schema.

There may be zero record batches

Parameters
  • [in] schema: the arrow::Schema for each batch

  • [in] batches: a std::vector of record batches

Result<std::shared_ptr<Table>> FromChunkedStructArray(const std::shared_ptr<ChunkedArray> &array)

Construct a Table from a chunked StructArray.

One column will be produced for each field of the StructArray.

Parameters

Result<std::shared_ptr<Table>> arrow::ConcatenateTables(const std::vector<std::shared_ptr<Table>> &tables, ConcatenateTablesOptions options = ConcatenateTablesOptions::Defaults(), MemoryPool *memory_pool = default_memory_pool())

Construct table from multiple input tables.