Skip to content

IP Extension Types

polars_iptools provides two Arrow extension types for storing IP addresses efficiently and preserving type information through Parquet and IPC round-trips.

Type Storage Best for
IPv4 UInt32 (4 bytes) IPv4-only datasets
IPAddress Binary (16 bytes) Mixed IPv4/IPv6, or any data written to Parquet/IPC

Both types are registered automatically when you import polars_iptools.

Reference

to_ipv4 — parse strings into the IPv4 type

import polars as pl
import polars_iptools as ip

df = pl.DataFrame({"ip": ["8.8.8.8", "192.168.1.1", "invalid"]})
df.with_columns(ip.to_ipv4("ip"))
# shape: (3, 2)
# ┌─────────────┬─────────────┐
# │ ip          ┆ ip          │
# │ ---         ┆ ---         │
# │ str         ┆ ipv4        │
# ╞═════════════╪═════════════╡
# │ 8.8.8.8     ┆ 8.8.8.8     │
# │ 192.168.1.1 ┆ 192.168.1.1 │
# │ invalid     ┆ null        │
# └─────────────┴─────────────┘

to_address — unified IPv4 + IPv6 type

df = pl.DataFrame({"ip": ["8.8.8.8", "2606:4700::1111", "192.168.1.1"]})
df.with_columns(ip.to_address("ip"))
# shape: (3, 2)
# ┌─────────────────┬─────────────────┐
# │ ip              ┆ ip              │
# │ ---             ┆ ---             │
# │ str             ┆ ip_addr         │
# ╞═════════════════╪═════════════════╡
# │ 8.8.8.8         ┆ 8.8.8.8         │
# │ 2606:4700::1111 ┆ 2606:4700::1111 │
# │ 192.168.1.1     ┆ 192.168.1.1     │
# └─────────────────┴─────────────────┘

to_string — convert back to canonical strings

df = pl.DataFrame({"ip": ["8.8.8.8", "2606:4700::1111"]})
df.with_columns(ip.to_address("ip").ip.to_string())

.ip namespace

All conversion functions are also available via the .ip expression and Series namespace:

# Expression namespace
pl.col("src_ip").ip.to_address()
pl.col("src_ip").ip.to_string()

# Series namespace
df["src_ip"].ip.to_ipv4()

Extracting IPs from free text

logs = pl.DataFrame({
    "text": [
        "conn from 8.8.8.8 and defanged 192[.]168[.]1[.]1",
        "public 1.1.1.1 and private 10.0.0.1",
    ]
})

# All IPs (defanged handled automatically)
logs.with_columns(ip.extract_ips("text"))

# Public only
logs.with_columns(ip.extract_public_ips("text"))

# Private only
logs.with_columns(ip.extract_private_ips("text"))

# With individual filters
logs.with_columns(
    ip.extract_ips("text", ipv6=True, ignore_loopback=True)
)

End-to-end workflow

This example shows a realistic pipeline: parse raw log IPs into typed columns, enrich with ASN data, write to Parquet, then reload — with the IPAddress type preserved automatically.

import polars as pl
import polars_iptools as ip

# --- 1. Raw log data with mixed IPv4/IPv6 ---
logs = pl.DataFrame({
    "ts": ["2024-01-01", "2024-01-01", "2024-01-01"],
    "src_ip": ["8.8.8.8", "2606:4700::1111", "192.168.1.1"],
    "bytes": [1024, 512, 256],
})

# --- 2. Parse to typed columns ---
# IPAddress handles both IPv4 and IPv6; preserves through Parquet/IPC
enriched = logs.with_columns(
    ip.to_address("src_ip").alias("src_ip_typed")
)

# --- 3. Enrich with ASN (works directly on IPAddress columns) ---
enriched = enriched.with_columns(
    ip.geoip.asn(pl.col("src_ip_typed")).alias("asn")
)
# shape: (3, 4)
# ┌────────────┬─────────────────┬───────┬──────────────────────┐
# │ ts         ┆ src_ip          ┆ bytes ┆ asn                  │
# │ ---        ┆ ---             ┆ ---   ┆ ---                  │
# │ str        ┆ str             ┆ i64   ┆ str                  │
# ╞════════════╪═════════════════╪═══════╪══════════════════════╡
# │ 2024-01-01 ┆ 8.8.8.8         ┆ 1024  ┆ AS15169 GOOGLE       │
# │ 2024-01-01 ┆ 2606:4700::1111 ┆ 512   ┆ AS13335 CLOUDFLARENET│
# │ 2024-01-01 ┆ 192.168.1.1     ┆ 256   ┆                      │
# └────────────┴─────────────────┴───────┴──────────────────────┘

# --- 4. Write to Parquet (extension type metadata is preserved) ---
enriched.write_parquet("logs_enriched.parquet")

# --- 5. Reload — src_ip_typed comes back as IPAddress, not Binary ---
reloaded = pl.read_parquet("logs_enriched.parquet")
print(reloaded.dtypes)
# [String, String, Int64, IPAddress, String]
#                          ^^^^^^^^^ type preserved!

# --- 6. Continue working with the typed column ---
reloaded.with_columns(
    pl.col("src_ip_typed").ip.to_string().alias("src_ip_str")
)

Why write typed columns to Parquet?

When you store raw IP strings, every read requires re-parsing. With IPAddress, the 16-byte binary is stored directly and the extension type name (polars_iptools.ip_address) is embedded in Parquet metadata — so re-reading restores the typed column automatically, with no extra .with_columns() needed.