Input / output¶
Interfaces¶
-
class
arrow::io
::
FileInterface
¶ Subclassed by arrow::io::InputStream, arrow::io::OutputStream
Public Functions
-
Status
Close
() = 0¶ Close the stream cleanly.
For writable streams, this will attempt to flush any pending data before releasing the underlying resource.
After Close() is called, closed() returns true and the stream is not available for further operations.
-
Status
Abort
()¶ Close the stream abruptly.
This method does not guarantee that any pending data is flushed. It merely releases any underlying resource used by the stream for its operation.
After Abort() is called, closed() returns true and the stream is not available for further operations.
-
bool
closed
() const = 0¶ Return whether the stream is closed.
-
Status
-
class
arrow::io
::
Readable
¶ Subclassed by arrow::io::InputStream
Public Functions
-
Result<int64_t>
Read
(int64_t nbytes, void *out) = 0¶ Read data from current file position.
Read at most
nbytes
from the current file position intoout
. The number of bytes read is returned.
-
Result<std::shared_ptr<Buffer>>
Read
(int64_t nbytes) = 0¶ Read data from current file position.
Read at most
nbytes
from the current file position. Less bytes may be read if EOF is reached. This method updates the current file position.In some cases (e.g. a memory-mapped file), this method may avoid a memory copy.
-
Result<int64_t>
-
class
Seekable
¶ Subclassed by arrow::io::RandomAccessFile, arrow::io::WritableFile
-
class
arrow::io
::
Writable
¶ Subclassed by arrow::io::OutputStream
Public Functions
-
Status
Write
(const void *data, int64_t nbytes) = 0¶ Write the given data to the stream.
This method always processes the bytes in full. Depending on the semantics of the stream, the data may be written out immediately, held in a buffer, or written asynchronously. In the case where the stream buffers the data, it will be copied. To avoid potentially large copies, use the Write variant that takes an owned Buffer.
Write the given data to the stream.
Since the Buffer owns its memory, this method can avoid a copy if buffering is required. See Write(const void*, int64_t) for details.
-
Status
-
class
arrow::io
::
InputStream
: public virtual arrow::io::FileInterface, public virtual arrow::io::Readable¶ Subclassed by arrow::io::internal::InputStreamConcurrencyWrapper< Derived >, arrow::io::RandomAccessFile, arrow::io::StdinStream, arrow::io::TransformInputStream, arrow::io::internal::InputStreamConcurrencyWrapper< BufferedInputStream >, arrow::io::internal::InputStreamConcurrencyWrapper< CompressedInputStream >, arrow::io::SlowInputStreamBase< InputStream >
Public Functions
-
Status
Advance
(int64_t nbytes)¶ Advance or skip stream indicated number of bytes.
- Return
- Parameters
[in] nbytes
: the number to move forward
-
Result<util::string_view>
Peek
(int64_t nbytes)¶ Return zero-copy string_view to upcoming bytes.
Do not modify the stream position. The view becomes invalid after any operation on the stream. May trigger buffering if the requested size is larger than the number of buffered bytes.
May return NotImplemented on streams that don’t support it.
- Parameters
[in] nbytes
: the maximum number of bytes to see
-
bool
supports_zero_copy
() const¶ Return true if InputStream is capable of zero copy Buffer reads.
Zero copy reads imply the use of Buffer-returning Read() overloads.
-
Status
-
class
arrow::io
::
RandomAccessFile
: public std::enable_shared_from_this<RandomAccessFile>, public arrow::io::InputStream, public arrow::io::Seekable¶ Subclassed by arrow::io::HdfsReadableFile, arrow::io::internal::RandomAccessFileConcurrencyWrapper< Derived >, arrow::io::ReadWriteFileInterface, arrow::py::PyReadableFile, parquet::ParquetInputWrapper, arrow::io::internal::RandomAccessFileConcurrencyWrapper< BufferReader >, arrow::io::internal::RandomAccessFileConcurrencyWrapper< CudaBufferReader >, arrow::io::internal::RandomAccessFileConcurrencyWrapper< ReadableFile >, arrow::io::SlowInputStreamBase< RandomAccessFile >
Public Functions
-
~RandomAccessFile
() override¶ Necessary because we hold a std::unique_ptr.
-
Result<int64_t>
GetSize
() = 0¶ Return the total file size in bytes.
This method does not read or move the current file position, so is safe to call concurrently with e.g. ReadAt().
-
Result<int64_t>
ReadAt
(int64_t position, int64_t nbytes, void *out)¶ Read data from given file position.
At most
nbytes
bytes are read. The number of bytes read is returned (it can be less thannbytes
if EOF is reached).This method can be safely called from multiple threads concurrently. It is unspecified whether this method updates the file position or not.
The default RandomAccessFile-provided implementation uses Seek() and Read(), but subclasses may override it with a more efficient implementation that doesn’t depend on implicit file positioning.
- Return
The number of bytes read, or an error
- Parameters
[in] position
: Where to read bytes from[in] nbytes
: The number of bytes to read[out] out
: The buffer to read bytes into
-
Result<std::shared_ptr<Buffer>>
ReadAt
(int64_t position, int64_t nbytes)¶ Read data from given file position.
At most
nbytes
bytes are read, but it can be less if EOF is reached.- Return
A buffer containing the bytes read, or an error
- Parameters
[in] position
: Where to read bytes from[in] nbytes
: The number of bytes to read
-
Future<std::shared_ptr<Buffer>>
ReadAsync
(const AsyncContext&, int64_t position, int64_t nbytes)¶ EXPERIMENTAL: Read data asynchronously.
-
Status
WillNeed
(const std::vector<ReadRange> &ranges)¶ EXPERIMENTAL: Inform that the given ranges may be read soon.
Some implementations might arrange to prefetch some of the data. However, no guarantee is made and the default implementation does nothing. For robust prefetching, use ReadAt() or ReadAsync().
Public Static Functions
Create an isolated InputStream that reads a segment of a RandomAccessFile.
Multiple such stream can be created and used independently without interference
- Parameters
[in] file
: a file instance[in] file_offset
: the starting position in the file[in] nbytes
: the extent of bytes to read. The file should have sufficient bytes available
-
-
class
OutputStream
: public virtual arrow::io::FileInterface, public arrow::io::Writable¶ Subclassed by arrow::io::BufferedOutputStream, arrow::io::BufferOutputStream, arrow::io::CompressedOutputStream, arrow::io::FileOutputStream, arrow::io::HdfsOutputStream, arrow::io::MockOutputStream, arrow::io::StderrStream, arrow::io::StdoutStream, arrow::io::WritableFile, arrow::py::PyOutputStream, parquet::ParquetOutputWrapper
-
class
ReadWriteFileInterface
: public arrow::io::RandomAccessFile, public arrow::io::WritableFile¶ Subclassed by arrow::io::MemoryMappedFile
Concrete implementations¶
In-memory streams¶
-
class
arrow::io
::
BufferReader
: public arrow::io::internal::RandomAccessFileConcurrencyWrapper<BufferReader>¶ Random access zero-copy reads on an arrow::Buffer.
Public Functions
-
BufferReader
(const util::string_view &data)¶ Instantiate from std::string or arrow::util::string_view.
Does not own data
-
bool
closed
() const override¶ Return whether the stream is closed.
-
bool
supports_zero_copy
() const override¶ Return true if InputStream is capable of zero copy Buffer reads.
Zero copy reads imply the use of Buffer-returning Read() overloads.
-
Future<std::shared_ptr<Buffer>>
ReadAsync
(const AsyncContext&, int64_t position, int64_t nbytes) override¶ EXPERIMENTAL: Read data asynchronously.
-
Status
WillNeed
(const std::vector<ReadRange> &ranges) override¶ EXPERIMENTAL: Inform that the given ranges may be read soon.
Some implementations might arrange to prefetch some of the data. However, no guarantee is made and the default implementation does nothing. For robust prefetching, use ReadAt() or ReadAsync().
-
-
class
arrow::io
::
MockOutputStream
: public arrow::io::OutputStream¶ A helper class to tracks the size of allocations.
Writes to this stream do not copy or retain any data, they just bump a size counter that can be later used to know exactly which data size needs to be allocated for actual writing.
Public Functions
-
Status
Close
() override¶ Close the stream cleanly.
For writable streams, this will attempt to flush any pending data before releasing the underlying resource.
After Close() is called, closed() returns true and the stream is not available for further operations.
-
bool
closed
() const override¶ Return whether the stream is closed.
-
Status
Write
(const void *data, int64_t nbytes) override¶ Write the given data to the stream.
This method always processes the bytes in full. Depending on the semantics of the stream, the data may be written out immediately, held in a buffer, or written asynchronously. In the case where the stream buffers the data, it will be copied. To avoid potentially large copies, use the Write variant that takes an owned Buffer.
-
Status
-
class
arrow::io
::
BufferOutputStream
: public arrow::io::OutputStream¶ An output stream that writes to a resizable buffer.
Public Functions
-
bool
closed
() const override¶ Return whether the stream is closed.
-
Status
Write
(const void *data, int64_t nbytes) override¶ Write the given data to the stream.
This method always processes the bytes in full. Depending on the semantics of the stream, the data may be written out immediately, held in a buffer, or written asynchronously. In the case where the stream buffers the data, it will be copied. To avoid potentially large copies, use the Write variant that takes an owned Buffer.
-
Status
Reset
(int64_t initial_capacity = 1024, MemoryPool *pool = default_memory_pool())¶ Initialize state of OutputStream with newly allocated memory and set position to 0.
- Return
- Parameters
[in] initial_capacity
: the starting allocated capacity[inout] pool
: the memory pool to use for allocations
Public Static Functions
-
Result<std::shared_ptr<BufferOutputStream>>
Create
(int64_t initial_capacity = 4096, MemoryPool *pool = default_memory_pool())¶ Create in-memory output stream with indicated capacity using a memory pool.
- Return
the created stream
- Parameters
[in] initial_capacity
: the initial allocated internal capacity of the OutputStream[inout] pool
: a MemoryPool to use for allocations
-
bool
-
class
arrow::io
::
FixedSizeBufferWriter
: public arrow::io::WritableFile¶ An output stream that writes into a fixed-size mutable buffer.
Public Functions
Input buffer must be mutable, will abort if not.
-
Status
Close
() override¶ Close the stream cleanly.
For writable streams, this will attempt to flush any pending data before releasing the underlying resource.
After Close() is called, closed() returns true and the stream is not available for further operations.
-
bool
closed
() const override¶ Return whether the stream is closed.
-
Status
Write
(const void *data, int64_t nbytes) override¶ Write the given data to the stream.
This method always processes the bytes in full. Depending on the semantics of the stream, the data may be written out immediately, held in a buffer, or written asynchronously. In the case where the stream buffers the data, it will be copied. To avoid potentially large copies, use the Write variant that takes an owned Buffer.
Local files¶
-
class
arrow::io
::
ReadableFile
: public arrow::io::internal::RandomAccessFileConcurrencyWrapper<ReadableFile>¶ An operating system file open in read-only mode.
Reads through this implementation are unbuffered. If many small reads need to be issued, it is recommended to use a buffering layer for good performance.
Public Functions
-
bool
closed
() const override¶ Return whether the stream is closed.
-
Status
WillNeed
(const std::vector<ReadRange> &ranges) override¶ EXPERIMENTAL: Inform that the given ranges may be read soon.
Some implementations might arrange to prefetch some of the data. However, no guarantee is made and the default implementation does nothing. For robust prefetching, use ReadAt() or ReadAsync().
Public Static Functions
-
Result<std::shared_ptr<ReadableFile>>
Open
(const std::string &path, MemoryPool *pool = default_memory_pool())¶ Open a local file for reading.
- Return
ReadableFile instance
- Parameters
[in] path
: with UTF8 encoding[in] pool
: a MemoryPool for memory allocations
-
Result<std::shared_ptr<ReadableFile>>
Open
(int fd, MemoryPool *pool = default_memory_pool())¶ Open a local file for reading.
The file descriptor becomes owned by the
ReadableFile, and will be closed on Close() or destruction.- Return
ReadableFile instance
- Parameters
[in] fd
: file descriptor[in] pool
: a MemoryPool for memory allocations
-
bool
-
class
arrow::io
::
FileOutputStream
: public arrow::io::OutputStream¶ An operating system file open in write-only mode.
Public Functions
-
Status
Close
() override¶ Close the stream cleanly.
For writable streams, this will attempt to flush any pending data before releasing the underlying resource.
After Close() is called, closed() returns true and the stream is not available for further operations.
-
bool
closed
() const override¶ Return whether the stream is closed.
-
Status
Write
(const void *data, int64_t nbytes) override¶ Write the given data to the stream.
This method always processes the bytes in full. Depending on the semantics of the stream, the data may be written out immediately, held in a buffer, or written asynchronously. In the case where the stream buffers the data, it will be copied. To avoid potentially large copies, use the Write variant that takes an owned Buffer.
Public Static Functions
-
Result<std::shared_ptr<FileOutputStream>>
Open
(const std::string &path, bool append = false)¶ Open a local file for writing, truncating any existing file.
When opening a new file, any existing file with the indicated path is truncated to 0 bytes, deleting any existing data
- Return
an open FileOutputStream
- Parameters
[in] path
: with UTF8 encoding[in] append
: append to existing file, otherwise truncate to 0 bytes
-
Result<std::shared_ptr<FileOutputStream>>
Open
(int fd)¶ Open a file descriptor for writing.
The underlying file isn’t truncated.
The file descriptor becomes owned by the
OutputStream, and will be closed on Close() or destruction.- Return
an open FileOutputStream
- Parameters
[in] fd
: file descriptor
-
Status
-
class
arrow::io
::
MemoryMappedFile
: public arrow::io::ReadWriteFileInterface¶ A file interface that uses memory-mapped files for memory interactions.
This implementation supports zero-copy reads. The same class is used for both reading and writing.
If opening a file in a writable mode, it is not truncated first as with FileOutputStream.
Public Functions
-
Status
Close
() override¶ Close the stream cleanly.
For writable streams, this will attempt to flush any pending data before releasing the underlying resource.
After Close() is called, closed() returns true and the stream is not available for further operations.
-
bool
closed
() const override¶ Return whether the stream is closed.
-
Result<int64_t>
Read
(int64_t nbytes, void *out) override¶ Read data from current file position.
Read at most
nbytes
from the current file position intoout
. The number of bytes read is returned.
-
Result<std::shared_ptr<Buffer>>
Read
(int64_t nbytes) override¶ Read data from current file position.
Read at most
nbytes
from the current file position. Less bytes may be read if EOF is reached. This method updates the current file position.In some cases (e.g. a memory-mapped file), this method may avoid a memory copy.
-
Result<std::shared_ptr<Buffer>>
ReadAt
(int64_t position, int64_t nbytes) override¶ Read data from given file position.
At most
nbytes
bytes are read, but it can be less if EOF is reached.- Return
A buffer containing the bytes read, or an error
- Parameters
[in] position
: Where to read bytes from[in] nbytes
: The number of bytes to read
-
Result<int64_t>
ReadAt
(int64_t position, int64_t nbytes, void *out) override¶ Read data from given file position.
At most
nbytes
bytes are read. The number of bytes read is returned (it can be less thannbytes
if EOF is reached).This method can be safely called from multiple threads concurrently. It is unspecified whether this method updates the file position or not.
The default RandomAccessFile-provided implementation uses Seek() and Read(), but subclasses may override it with a more efficient implementation that doesn’t depend on implicit file positioning.
- Return
The number of bytes read, or an error
- Parameters
[in] position
: Where to read bytes from[in] nbytes
: The number of bytes to read[out] out
: The buffer to read bytes into
-
Future<std::shared_ptr<Buffer>>
ReadAsync
(const AsyncContext&, int64_t position, int64_t nbytes) override¶ EXPERIMENTAL: Read data asynchronously.
-
Status
WillNeed
(const std::vector<ReadRange> &ranges) override¶ EXPERIMENTAL: Inform that the given ranges may be read soon.
Some implementations might arrange to prefetch some of the data. However, no guarantee is made and the default implementation does nothing. For robust prefetching, use ReadAt() or ReadAsync().
-
bool
supports_zero_copy
() const override¶ Return true if InputStream is capable of zero copy Buffer reads.
Zero copy reads imply the use of Buffer-returning Read() overloads.
-
Status
Write
(const void *data, int64_t nbytes) override¶ Write data at the current position in the file. Thread-safe.
Public Static Functions
-
Result<std::shared_ptr<MemoryMappedFile>>
Create
(const std::string &path, int64_t size)¶ Create new file with indicated size, return in read/write mode.
-
Status
Buffering input / output wrappers¶
-
class
arrow::io
::
BufferedInputStream
: public arrow::io::internal::InputStreamConcurrencyWrapper<BufferedInputStream>¶ An InputStream that performs buffered reads from an unbuffered InputStream, which can mitigate the overhead of many small reads in some cases.
Public Functions
-
Status
SetBufferSize
(int64_t new_buffer_size)¶ Resize internal read buffer; calls to Read(…) will read at least.
- Return
- Parameters
[in] new_buffer_size
: the new read buffer size
-
int64_t
bytes_buffered
() const¶ Return the number of remaining bytes in the read buffer.
-
int64_t
buffer_size
() const¶ Return the current size of the internal buffer.
-
std::shared_ptr<InputStream>
Detach
()¶ Release the raw InputStream.
Any data buffered will be discarded. Further operations on this object are invalid
- Return
raw the underlying InputStream
-
std::shared_ptr<InputStream>
raw
() const¶ Return the unbuffered InputStream.
-
bool
closed
() const override¶ Return whether the stream is closed.
Public Static Functions
Create a BufferedInputStream from a raw InputStream.
- Return
the created BufferedInputStream
- Parameters
[in] buffer_size
: the size of the temporary read buffer[in] pool
: a MemoryPool to use for allocations[in] raw
: a raw InputStream[in] raw_read_bound
: a bound on the maximum number of bytes to read from the raw input stream. The default -1 indicates that it is unbounded
-
Status
-
class
arrow::io
::
BufferedOutputStream
: public arrow::io::OutputStream¶ Public Functions
-
Status
SetBufferSize
(int64_t new_buffer_size)¶ Resize internal buffer.
- Return
- Parameters
[in] new_buffer_size
: the new buffer size
-
int64_t
buffer_size
() const¶ Return the current size of the internal buffer.
-
int64_t
bytes_buffered
() const¶ Return the number of remaining bytes that have not been flushed to the raw OutputStream.
-
Result<std::shared_ptr<OutputStream>>
Detach
()¶ Flush any buffered writes and release the raw OutputStream.
Further operations on this object are invalid
- Return
the underlying OutputStream
-
Status
Close
() override¶ Close the buffered output stream.
This implicitly closes the underlying raw output stream.
-
Status
Abort
() override¶ Close the stream abruptly.
This method does not guarantee that any pending data is flushed. It merely releases any underlying resource used by the stream for its operation.
After Abort() is called, closed() returns true and the stream is not available for further operations.
-
bool
closed
() const override¶ Return whether the stream is closed.
-
Status
Write
(const void *data, int64_t nbytes) override¶ Write the given data to the stream.
This method always processes the bytes in full. Depending on the semantics of the stream, the data may be written out immediately, held in a buffer, or written asynchronously. In the case where the stream buffers the data, it will be copied. To avoid potentially large copies, use the Write variant that takes an owned Buffer.
Write the given data to the stream.
Since the Buffer owns its memory, this method can avoid a copy if buffering is required. See Write(const void*, int64_t) for details.
-
std::shared_ptr<OutputStream>
raw
() const¶ Return the underlying raw output stream.
Public Static Functions
Create a buffered output stream wrapping the given output stream.
- Return
the created BufferedOutputStream
- Parameters
[in] buffer_size
: the size of the temporary write buffer[in] pool
: a MemoryPool to use for allocations[in] raw
: another OutputStream
-
Status
Compressed input / output wrappers¶
-
class
arrow::io
::
CompressedInputStream
: public arrow::io::internal::InputStreamConcurrencyWrapper<CompressedInputStream>¶ Public Functions
-
bool
closed
() const override¶ Return whether the stream is closed.
-
std::shared_ptr<InputStream>
raw
() const¶ Return the underlying raw input stream.
Public Static Functions
Create a compressed input stream wrapping the given input stream.
-
bool
-
class
arrow::io
::
CompressedOutputStream
: public arrow::io::OutputStream¶ Public Functions
-
Status
Close
() override¶ Close the compressed output stream.
This implicitly closes the underlying raw output stream.
-
Status
Abort
() override¶ Close the stream abruptly.
This method does not guarantee that any pending data is flushed. It merely releases any underlying resource used by the stream for its operation.
After Abort() is called, closed() returns true and the stream is not available for further operations.
-
bool
closed
() const override¶ Return whether the stream is closed.
-
Status
Write
(const void *data, int64_t nbytes) override¶ Write the given data to the stream.
This method always processes the bytes in full. Depending on the semantics of the stream, the data may be written out immediately, held in a buffer, or written asynchronously. In the case where the stream buffers the data, it will be copied. To avoid potentially large copies, use the Write variant that takes an owned Buffer.
-
std::shared_ptr<OutputStream>
raw
() const¶ Return the underlying raw output stream.
Public Static Functions
Create a compressed output stream wrapping the given output stream.
-
Status