Lecture 21
Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another.
A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (included nested and user-defined data types) designed to support the needs of analytic database systems, data frame libraries, and more.
Core implementations in:
The basic building blocks of Arrow are array
objects, arrays are collections of data of a uniform type.
A table is created by combining multiple arrays together to form the columns while also attaching names for each column.
Elements of an array can be selected using []
with an integer index or a slice, the former returns a typed scalar the latter an array.
The following types are language agnostic for the purpose of portability, however some differ slightly from what is available from Numpy and Pandas (or R),
Fixed-length primitive types - numbers, booleans, date and times, fixed size binary, decimals, and other values that fit into a given number
bool_()
, uint64()
, timestamp()
, date64()
, and many moreVariable-length primitive types - binary, string
Nested types - list, map, struct, and union
Dictionary type - An encoded categorical type
are a data structure that contains information on the names and types of columns for a table (or record batch),
Schemas can also store additional metadata (e.g. codebook like textual descriptions) in the form of a string:string dictionary,
list type:
struct type:
A dictionary array is the equivalent to a factor in R or pd.Categorical in Pandas,
Between a table and an array Arrow has the concept of a Record Batch - which represents a chunk of a larger table. They are composed of a named collection of equal-length arrays.
[]
can be used with a Record Batch to select columns (by name or index) or rows (by slice), additionally the slice()
method can be used to select rows.
As mentioned previously, table
objects are not part of the Arrow specification - rather they are a convenience tool provided to help with the wrangling of multiple Record Batches.
pyarrow.Table
num: int8
year: int64
name: string
----
num: [[1,2,3,2],[1,2,3,2],[1,2,3,2]]
year: [[2019,2020,2021,2022],[2019,2020,2021,2022],[2019,2020,2021,2022]]
name: [["Alice","Bob","Carol","Dave"],["Alice","Bob","Carol","Dave"],["Alice","Bob","Carol","Dave"]]
The columns of table
are therefore composed of the columns of each of the batches, these are stored as ChunckedArrays instead of Arrays to reflect this.
Conversion between NumPy arrays and Arrow arrays is straight forward,
We’ve already seen some basic conversion of Arrow table objects to Pandas, the conversions here are a bit more complex than with NumPy due in large part to how Pandas handles missing data.
Source (Pandas) | Destination (Arrow) |
---|---|
bool |
BOOL |
(u)int{8,16,32,64} |
(U)INT{8,16,32,64} |
float32 |
FLOAT |
float64 |
DOUBLE |
str / unicode |
STRING |
pd.Categorical |
DICTIONARY |
pd.Timestamp |
TIMESTAMP(unit=ns) |
datetime.date |
DATE |
datetime.time |
TIME64 |
Source (Arrow) | Destination (Pandas) |
---|---|
BOOL |
bool |
BOOL with nulls |
object (with values True , False , None ) |
(U)INT{8,16,32,64} |
(u)int{8,16,32,64} |
(U)INT{8,16,32,64} with nulls |
float64 |
FLOAT |
float32 |
DOUBLE |
float64 |
STRING |
str |
DICTIONARY |
pd.Categorical |
TIMESTAMP(unit=*) |
pd.Timestamp (np.datetime64[ns] ) |
DATE |
object (with datetime.date objects) |
TIME64 |
object (with datetime.time objects) |
Due to these discrepancies it is much more likely that converting from an Arrow array to a Panda series will require a type to be changed in which case the data will need to be copied. Like to_numpy()
, to_pandas()
also accepts the zero_copy_only
argument, however its default is False
.
0 1
1 2
2 3
3 4
dtype: int64
Error: pyarrow.lib.ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True
Error: pyarrow.lib.ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True
Zero copy conversions from
Array
orChunkedArray
to NumPy arrays or pandas Series are possible in certain narrow cases:
The Arrow data is stored in an integer (signed or unsigned
int8
throughint64
) or floating point type (float16
throughfloat64
). This includes many numeric types as well as timestamps.The Arrow data has no null values (since these are represented using bitmaps which are not supported by pandas).
For
ChunkedArray
, the data consists of a single chunk, i.e.arr.num_chunks == 1
. Multiple chunks will always require a copy because of pandas’s contiguousness requirement.In these scenarios,
to_pandas
orto_numpy
will be zero copy. In all other scenarios, a copy will be required.
Error: pyarrow.lib.ArrowInvalid: Cannot do zero copy conversion into multi-column DataFrame block
Error: pyarrow.lib.ArrowInvalid: Cannot do zero copy conversion into multi-column DataFrame block
Error: pyarrow.lib.ArrowInvalid: Cannot do zero copy conversion into multi-column DataFrame block
To convert from a Pandas DataFrame to an Arrow Table we can use the from_pandas()
method (schemas can also be inferred from DataFrames)
This and other text & delimiter based file formats are the most common and generally considered the most portable, however they have a number of significant draw backs
no explicit schema or other metadata
column types must be inferred from the data
numerical values stored as text (efficiency and precision issues)
limited compression options
… provides a standardized open-source columnar storage format for use in data analysis systems. It was created originally for use in Apache Hadoop with systems like Apache Drill, Apache Hive, Apache Impala, and Apache Spark adopting it as a shared standard for high performance data IO.
The values in each column are physically stored in contiguous memory locations
Efficient column-wise compression saves storage space
Compression techniques specific to a type can be applied
Queries that fetch specific column values do not read the entire row
Different encoding techniques can be applied to different columns
… is a portable file format for storing Arrow tables or data frames (from languages like Python or R) that utilizes the Arrow IPC format internally. Feather was created early in the Arrow project as a proof of concept for fast, language-agnostic data frame storage for Python (pandas) and R.
Direct columnar serialization of Arrow tables
Supports all Arrow data types and compression
Language agnostic
Metadata makes it possible to read only the necessary columns for an operation
np.random.seed(1234)
df = (
pd.read_csv("https://sta663-sp22.github.io/slides/data/penguins.csv")
.sample(10_000_000, replace=True)
.reset_index(drop=True)
)
num_cols = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]
df[num_cols] = df[num_cols] + np.random.normal(size=(df.shape[0],len(num_cols)))
df
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Chinstrap Dream 50.261096 18.764164 200.607259 3800.208171 male 2008
1 Gentoo Biscoe 49.594921 13.085831 223.316267 5551.827810 male 2008
2 Chinstrap Dream 45.919136 18.488115 190.024941 3450.928250 female 2007
3 Adelie Biscoe 41.829861 21.047924 200.945039 4050.412137 male 2008
4 Gentoo Biscoe 44.846957 14.669941 211.926404 4400.888107 female 2008
... ... ... ... ... ... ... ... ...
9999995 Gentoo Biscoe 49.994321 14.545704 221.270307 5448.948790 male 2009
9999996 Chinstrap Dream 51.568717 18.883271 196.283438 3749.222035 male 2007
9999997 Gentoo Biscoe 38.937558 13.634993 212.118822 4650.113320 female 2007
9999998 Adelie Torgersen 35.785542 17.751472 185.578798 3150.013148 female 2009
9999999 Adelie Biscoe 38.170442 20.282051 182.632278 3600.463322 male 2007
[10000000 rows x 8 columns]
import os
os.makedirs("scratch/", exist_ok=True)
df.to_csv("scratch/penguins-large.csv")
df.to_parquet("scratch/penguins-large.parquet")
import pyarrow.feather
pyarrow.feather.write_feather(
pa.Table.from_pandas(df),
"scratch/penguins-large.feather"
)
pyarrow.feather.write_feather(
pa.Table.from_pandas(df.dropna()),
"scratch/penguins-large_nona.feather"
)
## scratch/penguins-large.csv 1018.68 MB
## scratch/penguins-large.parquet 314.19 MB
## scratch/penguins-large.feather 489.14 MB
## scratch/penguins-large_nona.feather 509.24 MB
## 5.2 s ± 50.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
## 713 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
## 359 ms ± 61.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
## 213 ms ± 2.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
90.9 ms ± 528 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
## 921 ms ± 75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
## 727 ms ± 41.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
## 542 ms ± 6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
## 5.21 s ± 82.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
## 80.8 ms ± 619 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Polars is a lightning fast DataFrame library/in-memory query engine. Its embarrassingly parallel execution, cache efficient algorithms and expressive API makes it perfect for efficient data wrangling, data pipelines, snappy APIs and so much more.
The goal of Polars is to provide a lightning fast DataFrame library that:
- Utilizes all available cores on your machine.
- Optimizes queries to reduce unneeded work/memory allocations.
- Handles datasets much larger than your available RAM.
- Has an API that is consistent and predictable.
- Has a strict schema (data-types should be known before running the query).
Polars is written in Rust which gives it C/C++ performance and allows it to fully control performance critical parts in a query engine.
Polars does not have a multi-index/index
Polars uses Apache Arrow arrays to represent data in memory while Pandas uses Numpy arrays
Polars has more support for parallel operations than Pandas
Polars can lazily evaluate queries and apply query optimization
Polars syntax is similar but distinct from Pandas
Sta 663 - Spring 2023