BigQuery
Ingest any Google BigQuery query result into the workspace knowledge graph as typed nodes, with incremental delta loads and cross-domain edges to existing entities from other connectors.
BigQuery
Any SQL result → typed nodes + cross-domain edges into the workspace graph.
The BigQuery connector lets you point the workspace at any analytic table or query and ingest the result as typed nodes. It's the escape hatch for data that doesn't fit any of the purpose-built connectors — internal warehouses, dbt models, event tables, CRM mirrors — and the natural cross-domain bridge: the manifest's edge_mappings knob declares how each row links to entities that already exist in your graph from other connectors.
Authentication is via a tenant-supplied service-account JSON. Queries are dry-run first to forecast bytes billed and aborted when the estimate exceeds the configured cost cap, so a malformed query can't quietly drain your BigQuery budget.
What gets ingested
Each row of your query result becomes one node of the configured type (default bigquery.row, but you can choose any free-form type — convention is to prefix with bigquery.):
| Source | Node type | Properties |
|---|---|---|
| Query row | <node_type> (your choice) | Every selected column, plus bigquery_external_id (from your external_id_column) for idempotent upserts |
name_column controls the human-readable display name; rows fall back to the external id when the configured column is null.
Cross-domain edges
The edge_mappings setting declares how each ingested row links to existing nodes in the workspace — regardless of which connector originally created them:
{
"edge_type": "paid_to",
"column": "vendor_email",
"target_node_type": "person",
"match_property": "email",
"match_strategy": "normalized_email"
}For every row whose vendor_email column matches an existing person node's email property (via the chosen match_strategy), the connector writes a paid_to edge from the new BigQuery-row node to that person. match_strategy can be exact, normalized_email, or normalized_name.
Malformed mappings log and skip rather than aborting the run.
Real use cases
- CRM mirror — ingest your
crm.opportunitiestable asbigquery.opportunitynodes, withedge_mappingspointing ataccount_owner_email→person(so every opportunity is edged to the AE who owns it) andcustomer_domain→organization(so opportunities cluster by account). - Product event rollups — point at a dbt model that summarizes "weekly active workspaces per customer" and ingest each row as a
bigquery.usage_snapshotnode with edges to the correspondingorganization. The agent can now answer "which customers are growing usage week-over-week?" as a graph traversal. - Vendor → person bridge — pair with Plaid: ingest a vendor CSV from BigQuery and stamp
vendor_ofedges betweenbigquery.vendorand existingfinancial.merchantnodes so every transaction inherits the vendor's organization context.
Settings
| Key | Type | Default | Description |
|---|---|---|---|
project_id | string | — | GCP project the query bills to. Must match the service account or grant it BigQuery Job User on this project. |
location | string | US | BigQuery dataset location. Mismatches surface as location errors at job submit time. |
sql | string | — | The query to ingest. Standard SQL only. The connector wraps it with the delta filter when one is configured. |
delta_column | string | "" | Column tracking "what we've already ingested." Next sync filters for rows strictly greater than the stored cursor. |
delta_mode | list | timestamp | One of timestamp / numeric / none. |
node_type | string | bigquery.row | Free-form type tag. Convention: prefix with bigquery. so traversals can isolate warehouse-sourced nodes. |
name_column | string | "" | Column whose value becomes the display name. Falls back to external id when blank. |
external_id_column | string | id | Dedup key — upserts on (workspace, node_type, properties.bigquery_external_id). |
edge_mappings | string[] | [] | Cross-domain edge declarations (see above). |
max_rows_per_sync | number | 100,000 | Hard cap on rows per run. |
cost_cap_bytes | number | 1 GiB | Per-sync ceiling on bytes billed; dry-run aborts the job if the estimate exceeds this. |
use_query_cache | boolean | true | Pass use_query_cache=True to BigQuery. Cached results hide upstream changes — flip off when the source mutates within the cache TTL. |
Authentication
Tenant-supplied service-account JSON. Paste it into the connection wizard; Oxagen encrypts it at rest. The connector requests only the bigquery.readonly scope.
Cost guardrails
Every sync dry-runs the query first via client.query(..., dry_run=True) and reads total_bytes_processed from the estimate. If the estimate exceeds cost_cap_bytes, the run aborts with a clean error before any real bytes are billed. The same cap is also passed as maximum_bytes_billed to the real query, so even a wildly inaccurate dry-run estimate can't blow past the cap.
Plaid
Connect bank and credit-card accounts via Plaid Link. Transactions, accounts, and merchants land as typed `financial.*` nodes with the edges that make money-movement graph-traversable.
Cheaper models with Oxagen
How a typed code graph and business ontology let your agents run on smaller, faster models without giving up accuracy — the model-selection argument with the eval methodology to verify it.