pandera0.24.0
Published
A light-weight and flexible data validation and testing tool for statistical data objects.
pip install pandera
Package Downloads
Authors
Project URLs
Requires Python
>=3.9
Dependencies
- packaging
>=20.0
- pydantic
- typeguard
- typing_extensions
- typing_inspect
>=0.6.0
- numpy
>=1.24.4; extra == "pandas"
- pandas
>=2.1.1; extra == "pandas"
- hypothesis
>=6.92.7; extra == "strategies"
- scipy
; extra == "hypotheses"
- pyyaml
>=5.1; extra == "io"
- black
; extra == "io"
- frictionless
<=4.40.8; extra == "io"
- pandas-stubs
; extra == "mypy"
- fastapi
; extra == "fastapi"
- geopandas
; extra == "geopandas"
- shapely
; extra == "geopandas"
- pyspark
[connect]>=3.2.0; extra == "pyspark"
- modin
; extra == "modin"
- ray
; extra == "modin"
- dask
[dataframe]; extra == "modin"
- distributed
; extra == "modin"
- modin
; extra == "modin-ray"
- ray
; extra == "modin-ray"
- modin
; extra == "modin-dask"
- dask
[dataframe]; extra == "modin-dask"
- distributed
; extra == "modin-dask"
- dask
[dataframe]; extra == "dask"
- distributed
; extra == "dask"
- polars
>=0.20.0; extra == "polars"
- hypothesis
>=6.92.7; extra == "all"
- scipy
; extra == "all"
- pyyaml
>=5.1; extra == "all"
- black
; extra == "all"
- frictionless
<=4.40.8; extra == "all"
- pyspark
[connect]>=3.2.0; extra == "all"
- modin
; extra == "all"
- ray
; extra == "all"
- dask
[dataframe]; extra == "all"
- distributed
; extra == "all"
- pandas-stubs
; extra == "all"
- fastapi
; extra == "all"
- geopandas
; extra == "all"
- shapely
; extra == "all"
- polars
>=0.20.0; extra == "all"
The Open-source Framework for Validating DataFrame-like Objects
📊 🔎 ✅
Data validation for scientists, engineers, and analysts seeking correctness.
Pandera is a Union.ai open source project that provides a flexible and expressive API for performing data validation on dataframe-like objects. The goal of Pandera is to make data processing pipelines more readable and robust with statistically typed dataframes.
Install
Pandera supports multiple dataframe libraries, including pandas, polars, pyspark, and more. To validate pandas
DataFrames, install Pandera with the pandas
extra:
With pip
:
pip install 'pandera[pandas]'
With uv
:
uv pip install 'pandera[pandas]'
With conda
:
conda install -c conda-forge pandera-pandas
Get started
First, create a dataframe:
import pandas as pd
import pandera.pandas as pa
# data to validate
df = pd.DataFrame({
"column1": [1, 2, 3],
"column2": [1.1, 1.2, 1.3],
"column3": ["a", "b", "c"],
})
Validate the data using the object-based API:
# define a schema
schema = pa.DataFrameSchema({
"column1": pa.Column(int, pa.Check.ge(0)),
"column2": pa.Column(float, pa.Check.lt(10)),
"column3": pa.Column(
str,
[
pa.Check.isin([*"abc"]),
pa.Check(lambda series: series.str.len() == 1),
]
),
})
print(schema.validate(df))
# column1 column2 column3
# 0 1 1.1 a
# 1 2 1.2 b
# 2 3 1.3 c
Or validate the data using the class-based API:
# define a schema
class Schema(pa.DataFrameModel):
column1: int = pa.Field(ge=0)
column2: float = pa.Field(lt=10)
column3: str = pa.Field(isin=[*"abc"])
@pa.check("column3")
def custom_check(cls, series: pd.Series) -> pd.Series:
return series.str.len() == 1
print(Schema.validate(df))
# column1 column2 column3
# 0 1 1.1 a
# 1 2 1.2 b
# 2 3 1.3 c
[!WARNING] Pandera
v0.24.0
introduces thepandera.pandas
module, which is now the (highly) recommended way of definingDataFrameSchema
s andDataFrameModel
s forpandas
data structures likeDataFrame
s. Defining a dataframe schema from the top-levelpandera
module will produce aFutureWarning
:import pandera as pa schema = pa.DataFrameSchema({"col": pa.Column(str)})
Update your import to:
import pandera.pandas as pa
And all of the rest of your pandera code should work. Using the top-level
pandera
module to accessDataFrameSchema
and the other pandera classes or functions will be deprecated in a future version
Next steps
See the official documentation to learn more.