tracing-upstream-lineage
上流のデータリネージを追跡し、データの発生源と履歴パスを分析して、「データはどこから来たのか」という問いに答えます。
npx skills add astronomer/agents --skill tracing-upstream-lineageBefore / After 効果比較
1 组データソースが不明な問題に直面すると、手動で上流の血縁関係を追跡するのは時間と労力がかかり、データの依存関係や影響を迅速に理解することが困難で、分析効率が低下します。
データのすべての上流ソースと変換パスを簡単に追跡し、データ血縁関係を迅速に理解し、データ品質とコンプライアンスを確保し、分析効率を向上させます。
Upstream Lineage: Sources
Trace the origins of data - answer "Where does this data come from?"
Lineage Investigation
Step 1: Identify the Target Type
Determine what we're tracing:
- Table: Trace what populates this table
- Column: Trace where this specific column comes from
- DAG: Trace what data sources this DAG reads from
Step 2: Find the Producing DAG
Tables are typically populated by Airflow DAGs. Find the connection:
-
Search DAGs by name: Use
af dags listand look for DAG names matching the table nameload_customers->customerstableetl_daily_orders->orderstable
-
Explore DAG source code: Use
af dags source <dag_id>to read the DAG definition- Look for INSERT, MERGE, CREATE TABLE statements
- Find the target table in the code
-
Check DAG tasks: Use
af tasks list <dag_id>to see what operations the DAG performs
On Astro
If you're running on Astro, the Lineage tab in the Astro UI provides visual lineage exploration across DAGs and datasets. Use it to quickly trace upstream dependencies without manually searching DAG source code.
On OSS Airflow
Use DAG source code and task logs to trace lineage (no built-in cross-DAG UI).
Step 3: Trace Data Sources
From the DAG code, identify source tables and systems:
SQL Sources (look for FROM clauses):
# In DAG code:
SELECT * FROM source_schema.source_table # <- This is an upstream source
External Sources (look for connection references):
S3Operator-> S3 bucket sourcePostgresOperator-> Postgres database sourceSalesforceOperator-> Salesforce API sourceHttpOperator-> REST API source
File Sources:
- CSV/Parquet files in object storage
- SFTP drops
- Local file paths
Step 4: Build the Lineage Chain
Recursively trace each source:
TARGET: analytics.orders_daily
^
+-- DAG: etl_daily_orders
^
+-- SOURCE: raw.orders (table)
| ^
| +-- DAG: ingest_orders
| ^
| +-- SOURCE: Salesforce API (external)
|
+-- SOURCE: dim.customers (table)
^
+-- DAG: load_customers
^
+-- SOURCE: PostgreSQL (external DB)
Step 5: Check Source Health
For each upstream source:
- Tables: Check freshness with the checking-freshness skill
- DAGs: Check recent run status with
af dags stats - External systems: Note connection info from DAG code
Lineage for Columns
When tracing a specific column:
- Find the column in the target table schema
- Search DAG source code for references to that column name
- Trace through transformations:
- Direct mappings:
source.col AS target_col - Transformations:
COALESCE(a.col, b.col) AS target_col - Aggregations:
SUM(detail.amount) AS total_amount
- Direct mappings:
Output: Lineage Report
Summary
One-line answer: "This table is populated by DAG X from sources Y and Z"
Lineage Diagram
[Salesforce] --> [raw.opportunities] --> [stg.opportunities] --> [fct.sales]
| |
DAG: ingest_sfdc DAG: transform_sales
Source Details
| Source | Type | Connection | Freshness | Owner |
|---|---|---|---|---|
| raw.orders | Table | Internal | 2h ago | data-team |
| Salesforce | API | salesforce_conn | Real-time | sales-ops |
Transformation Chain
Describe how data flows and transforms:
- Raw data lands in
raw.ordersvia Salesforce API sync - DAG
transform_orderscleans and dedupes intostg.orders - DAG
build_order_factsjoins with dimensions intofct.orders
Data Quality Implications
- Single points of failure?
- Stale upstream sources?
- Complex transformation chains that could break?
Related Skills
- Check source freshness: checking-freshness skill
- Debug source DAG: debugging-dags skill
- Trace downstream impacts: tracing-downstream-lineage skill
- Add manual lineage annotations: annotating-task-lineage skill
- Build custom lineage extractors: creating-openlineage-extractors skill
ユーザーレビュー (0)
レビューを書く
レビューなし
統計データ
ユーザー評価
この Skill を評価