| dbt | ||
| docs | ||
| nfl_data | ||
| .dockerignore | ||
| .env.example | ||
| .gitignore | ||
| app.py | ||
| Dockerfile | ||
| README.md | ||
| requirements.txt | ||
NFL-DATA
Sports data extraction, load, and transformation using Python and dbt.
Project setup
Set up and activate a virtual environment with the following command:
python -m venv venv
# Windows
./venv/Scripts/activate
# Linux
source venv/bin/activate
Install all the necessary packages using the following command:
python -m pip install -r requirements.txt
Environment variables
Secrets management is made possible using Infisical. Install the CLI to inject secrets.
Otherwise, use a .env to specify the required variables. See .env.example.
Command reference
NFL-DATA CLI
Infisical and secrets
Using Infisical for secrets management is recommended. Commands can be wrapped in an Infisical call:
infisical run --command "python app.py ..."
Load strategy
All commands have the --load-strategy or -ls option to specify how NFL-DATA will load new JSON data into the warehouse:
replaceclears out all previously loaded JSON data before inserting.addwill load JSON data without any additional checks. Will introduce multiple entries for the same object.skipwill only load JSON data if no other data exists for a given object.day_replacewill replace JSON data if the same object was previously loaded today.
Modules
There are two available modules:
nflfor NFL data, including extraction from ESPN's API and modeling.nbafor NBA data, including extraction from ESPN's API and modeling.smsfor Super Mario Sluggers statfile processing.
All commands and subcommands can be followed by the --help flag to describe commands, arguments, and options.
python app.py nfl load-game --help
dbt
To use dbt commands, first change directory into the dbt subdirectory.
cd dbt/
From here, all dbt commands can be accessed. Take a look at the dbt Command Reference page for details.
Running a full build will model newly loaded raw data since last full build:
dbt build
Core architecture
Data pipelines in this repository adopt the following basic structure:
graph TD;
A@{ shape: cloud, label: "External API"}-->B[("S3-compatible storage")]-->|to landing zone table|C[("Database")];
D@{ shape: processes, label: "Manual file upload"}-->B
C-->E("dbt modeling")-->C;
Data is ingested from an external API or from a manual file upload to an S3-compatible storage service (AWS S3, Minio, etc.) as a caching layer. The raw data, typically JSON, is then loaded from S3 into a landing zone table in a database.
The landing zone table has the following structure:
object_path |
object_type |
raw_data |
|---|---|---|
| Path to object in S3 | Semantic categorization for object (e.g. athlete, event) | Raw data as stored in S3 |
By caching in S3 and loading the raw data into the database, it can be processed, normalized, and modeled in-database using dbt.