Getting Started¶
This guide will help you get started with anyschema and explore its core functionality.
Basic Usage¶
anyschema accepts specifications in multiple formats (and more to come, see [anyschema#11])
and converts them to dataframe schemas. Let's explore each approach and when to use it.
With Pydantic Models¶
The most common way to use anyschema is with Pydantic models:
from anyschema import AnySchema
from pydantic import BaseModel
class User(BaseModel):
id: int
username: str
email: str
is_active: bool
schema = AnySchema(spec=User)
Convert to different schema formats:
With TypedDict¶
You can use TypedDict for a lightweight way to define typed structures:
from anyschema import AnySchema
from typing_extensions import TypedDict
class User(TypedDict):
id: int
username: str
email: str
is_active: bool
schema = AnySchema(spec=User)
print(schema.to_arrow())
With dataclasses¶
You can also use plain Python dataclasses
from anyschema import AnySchema
from dataclasses import dataclass
@dataclass
class User:
id: int
username: str
email: str
is_active: bool
schema = AnySchema(spec=User)
print(schema.to_arrow())
With attrs classes¶
attrs provides a powerful way to write classes with less boilerplate:
from anyschema import AnySchema
from attrs import define
@define
class User:
id: int
username: str
email: str
is_active: bool
schema = AnySchema(spec=User)
print(schema.to_arrow())
With Python Mappings¶
You can also use plain Python mappings (such as dictionaries):
from anyschema import AnySchema
spec = {
"id": int,
"username": str,
"email": str,
"is_active": bool,
}
schema = AnySchema(spec=spec)
print(schema.to_arrow())
With Sequence of Tuples¶
Or use a sequence of (name, type) tuples:
from anyschema import AnySchema
spec = [
("id", int),
("username", str),
("email", str),
("is_active", bool),
]
schema = AnySchema(spec=spec)
print(schema.to_polars())
Nested Types¶
You can use nested structures with Pydantic models, dataclasses, or TypedDict:
from anyschema import AnySchema
from pydantic import BaseModel
class Address(BaseModel):
street: str
city: str
country: str
class Person(BaseModel):
name: str
age: int
addresses: list[Address]
schema = AnySchema(spec=Person)
pa_schema = schema.to_arrow()
print(pa_schema)
name: string
age: int64
addresses: list
As you can see, a field (addresses) that contains a nested structure is correctly represented as a nested struct in
the schema.
Working with (Integer) Constraints¶
Constraints are processed by the AnnotatedTypesStep
parser step, which refines types based on their metadata. The following examples demonstrate how constraints are handled.
Pydantic's constrained integer types are automatically converted to appropriate unsigned or signed integers:
from anyschema import AnySchema
from pydantic import BaseModel, PositiveInt, NonNegativeInt
class Metrics(BaseModel):
count: PositiveInt
offset: NonNegativeInt
delta: int
schema = AnySchema(spec=Metrics)
arrow_schema = schema.to_arrow()
print(arrow_schema)
Using Annotated Types¶
You can also use typing.Annotated with constraint metadata:
from typing import Annotated
from anyschema import AnySchema
from pydantic import BaseModel, Field
class Product(BaseModel):
name: str
price: Annotated[float, Field(gt=0)] # Price must be positive
quantity: Annotated[
int, Field(ge=0, lt=100)
] # Quantity must be non-negative, and say we limit it to <100
schema = AnySchema(spec=Product)
print(schema.to_polars())
Using Narwhals Directly¶
You can also work with Narwhals schemas directly (and pass them to AnySchema, which acts as a no-op in this case):
from narwhals.schema import Schema
import narwhals as nw
from anyschema import AnySchema
# Create a Narwhals schema
nw_schema = Schema(
{
"id": nw.Int64(),
"name": nw.String(),
"scores": nw.List(nw.Float64()),
}
)
schema = AnySchema(spec=nw_schema)
Pandas output format¶
pandas schema
Unlike pyarrow and polars, pandas does not have a native schema representation. Therefore our output is a dictionary mapping column names to dtypes.
pandas multiple dtype_backend's
pandas supports multiple dtype backends that affect types nullability:
None(default): Uses standard NumPy dtypes (not nullable)."numpy_nullable"* Uses pandas nullable dtypes (e.g.,Int64instead ofint64)."pyarrow": Uses PyArrow-backed dtypes (better performance, native nullable support).
You can specify which backend to use via the dtype_backend parameter, either for all fields together, or for each
field individually.
Let's see it in practice:
from anyschema import AnySchema
from pydantic import BaseModel, PositiveInt, NonNegativeInt
class Metrics(BaseModel):
count: PositiveInt
offset: NonNegativeInt
delta: int
schema = AnySchema(spec=Metrics)
pd_schema = schema.to_pandas(
dtype_backend=(
"pyarrow", # `count` will be mapped to a pyarrow dtype
"numpy_nullable", # `offset` will be mapped to a pandas nullable numpy dtype
None, # `delta` will be mapped to the default pandas numpy dtype
)
)
print(pd_schema)
Error Handling¶
anyschema will raise exceptions for unsupported types:
from anyschema import AnySchema
try:
# This will fail - set is not supported
schema = AnySchema(spec={"invalid": set})
arrow_schema = schema.to_arrow()
except NotImplementedError as e:
print(f"Error: {e}")
# Error: No parser in the pipeline could handle type: builtins.set
For Union types with more than two members (excluding None), an error is raised: