This section will provide a look into some of pandas internals. It’s primarily intended for developers of pandas itself.
In pandas there are a few objects implemented which can serve as valid containers for the axis labels:
Index: the generic “ordered set” object, an ndarray of object dtype assuming nothing about its contents. The labels must be hashable (and likely immutable) and unique. Populates a dict of label to location in Cython to do O(1) lookups.
Index
O(1)
Int64Index: a version of Index highly optimized for 64-bit integer data, such as time stamps
Int64Index
Float64Index: a version of Index highly optimized for 64-bit float data
Float64Index
MultiIndex: the standard hierarchical index object
MultiIndex
DatetimeIndex: An Index object with Timestamp boxed elements (impl are the int64 values)
DatetimeIndex
Timestamp
TimedeltaIndex: An Index object with Timedelta boxed elements (impl are the in64 values)
TimedeltaIndex
Timedelta
PeriodIndex: An Index object with Period elements
PeriodIndex
There are functions that make the creation of a regular index easy:
date_range: fixed frequency date range generated from a time rule or DateOffset. An ndarray of Python datetime objects
date_range
period_range: fixed frequency date range generated from a time rule or DateOffset. An ndarray of Period objects, representing timespans
period_range
Period
The motivation for having an Index class in the first place was to enable different implementations of indexing. This means that it’s possible for you, the user, to implement a custom Index subclass that may be better suited to a particular application than the ones provided in pandas.
From an internal implementation point of view, the relevant methods that an Index must define are one or more of the following (depending on how incompatible the new object internals are with the Index functions):
get_loc: returns an “indexer” (an integer, or in some cases a slice object) for a label
get_loc
slice_locs: returns the “range” to slice between two labels
slice_locs
get_indexer: Computes the indexing vector for reindexing / data alignment purposes. See the source / docstrings for more on this
get_indexer
get_indexer_non_unique: Computes the indexing vector for reindexing / data alignment purposes when the index is non-unique. See the source / docstrings for more on this
get_indexer_non_unique
reindex: Does any pre-conversion of the input index then calls get_indexer
reindex
union, intersection: computes the union or intersection of two Index objects
union
intersection
insert: Inserts a new label into an Index, yielding a new object
insert
delete: Delete a label, yielding a new object
delete
drop: Deletes a set of labels
drop
take: Analogous to ndarray.take
take
Internally, the MultiIndex consists of a few things: the levels, the integer codes (until version 0.24 named labels), and the level names:
In [1]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']], ...: names=['first', 'second']) ...: In [2]: index Out[2]: MultiIndex([(0, 'one'), (0, 'two'), (1, 'one'), (1, 'two'), (2, 'one'), (2, 'two')], names=['first', 'second']) In [3]: index.levels Out[3]: FrozenList([[0, 1, 2], ['one', 'two']]) In [4]: index.codes Out[4]: FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]]) In [5]: index.names Out[5]: FrozenList(['first', 'second'])
You can probably guess that the codes determine which unique element is identified with that location at each layer of the index. It’s important to note that sortedness is determined solely from the integer codes and does not check (or care) whether the levels themselves are sorted. Fortunately, the constructors from_tuples and from_arrays ensure that this is true, but if you compute the levels and codes yourself, please be careful.
from_tuples
from_arrays
Pandas extends NumPy’s type system with custom types, like Categorical or datetimes with a timezone, so we have multiple notions of “values”. For 1-D containers (Index classes and Series) we have the following convention:
Categorical
Series
cls._ndarray_values is always a NumPy ndarray. Ideally, _ndarray_values is cheap to compute. For example, for a Categorical, this returns the codes, not the array of objects.
cls._ndarray_values
ndarray
_ndarray_values
cls._values refers is the “best possible” array. This could be an ndarray, ExtensionArray, or in Index subclass (note: we’re in the process of removing the index subclasses here so that it’s always an ndarray or ExtensionArray).
cls._values
ExtensionArray
So, for example, Series[category]._values is a Categorical, while Series[category]._ndarray_values is the underlying codes.
Series[category]._values
Series[category]._ndarray_values
This section has been moved to Subclassing pandas data structures.