3 min readfrom Data Science

I built an experimental orchestration language for reproducible data science called 'T'

Hey r/datascience,

I've been working on a side project called T (or tlang) for the past year or so, and I've just tagged the v0.51.2 "Sangoku" public beta. The short pitch: it's a small functional DSL for orchestrating polyglot data science pipelines, with Nix as a hard dependency.

What problem it's trying to solve

The "works on my machine" problem for data science is genuinely hard. R and Python projects accumulate dependency drift quietly until something breaks six months later, or on someone else's machine. `uv` for Python is great and{renv}helps in R-land, but they don't cross language boundaries cleanly, and they don't pin system dependencies. Most orchestration tools are language-specific and require some work to make cross languages.

T's thesis is: what if reproducibility was mandatory by design? You can't run a T script without wrapping it in a pipeline {} block. Every node in that pipeline runs in its own Nix sandbox. DataFrames move between R, Python, and T via Apache Arrow IPC. Models move via PMML. The environment is a Nix flake, so it's bit-for-bit reproducible.

What it looks like

p = pipeline { -- Native T node data = node(command = read_csv("data.csv") |> filter($age > 25)) -- rn defines an R node; pyn() a Python node model_r = rn( -- Python or R code gets wrapped inside a <{}> block command = <{ lm(score ~ age, data = data) }>, serializer = ^pmml, deserializer = ^csv ) -- Back to T for predictions (which could just as well have been -- done in another R node) predictions = node( command = data |> mutate($pred = predict(data, model_r)), deserializer = ^pmml ) } build_pipeline(p) 

The ^pmml, ^csv etc. are first-class serializers from a registry. They handle data interchange contracts between nodes so the pipeline builder can catch mismatches at build time rather than at runtime.

What's in the language itself

  • Strictly functional: no loops, no mutable state, immutable by default (:= to reassign, rm() to delete)
  • Errors are values, not exceptions. |> short-circuits on errors; ?|> forwards them for recovery
  • NSE column syntax ($col) inside data verbs, heavily inspired by dplyr
  • Arrow-backed DataFrames, native CSV/Parquet/Feather I/O
  • A native PMML evaluator so you can train in Python or R and predict in T without a runtime dependency
  • A REPL for interactive exploration

What it's missing

  • Users ;)
  • Julia support (but it's planned)

What I'm looking for

Honest feedback, especially:

  • Are there obvious workflow patterns that the pipeline model doesn't support?
  • Any rough edges in the installation or getting-started experience?

You can try it with:

nix shell github:b-rodrigues/tlang t init --project my_test_project 

(Requires Nix with flakes enabled — the Determinate Systems installer is the easiest path if you don't have it.)

Repo: https://github.com/b-rodrigues/tlang
Docs: https://tstats-project.org

Happy to answer questions here!

submitted by /u/brodrigues_co
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#generative AI for data analysis
#Excel alternatives for data analysis
#natural language processing for spreadsheets
#no-code spreadsheet solutions
#real-time data collaboration
#data visualization tools
#data analysis tools
#big data management in spreadsheets
#conversational data analysis
#intelligent data visualization
#enterprise data management
#big data performance
#data cleaning solutions
#natural language processing
#AI-native spreadsheets
#rows.com
#financial modeling with spreadsheets
#cloud-native spreadsheets
#machine learning in spreadsheet applications
#self-service analytics tools