QSIM — Orchestration & Sentiment Pipeline

Quant signal execution platform that combines model outputs with Reddit‑based sentiment, powered by Prefect ETL flows, LLM sentiment analysis, and cost‑efficient ClickHouse data pipelines.

PrefectFastAPIClickHouseETLRedditLLM

2024 - 2025

Overview

QSIM (Quant Signal Manager) is an internal execution tool for working with trading signals. It brings together:

Manually trained quant models
Social‑media sentiment signals (Reddit)
A fast data layer on ClickHouse

so researchers and downstream systems can query, combine, and act on enriched signals at scale.

My role was to design and implement the end‑to‑end data pipelines: from Reddit ingestion and LLM‑based sentiment analysis to asset tagging and storage in ClickHouse, plus the core libraries that make this data easy to consume.

Data Pipelines & Orchestration

Designed Prefect 2 flows for the entire Reddit → sentiment → signals pipeline.
Implemented ETL steps for:
- fetching and normalizing Reddit posts and comments,
- running LLM‑based sentiment classification,
- performing asset recognition via keyword/tag rules,
- writing clean, analytics‑ready records into ClickHouse.
Exposed these flows as reusable building blocks for different assets and universes, so new signal sets could be onboarded with minimal code.

Cost‑Efficient Reddit Parsing

One of the main problems in earlier systems was the reliance on large proxy pools to collect Reddit data, which was expensive and brittle.

I designed and implemented a new Reddit parsing system that reduced proxy costs to absolute zero:

Switched from heavy proxy usage to legit bot accounts:
- set up multiple Reddit bot accounts respecting Reddit policies,
- used their combined rate limits instead of rotating proxies.
Implemented a custom lightweight load balancer that:
- tracks each account’s current usage and limits,
- decides which account should handle the next request,
- avoids hitting per‑account rate limits while keeping throughput high.
Result:
- $0 proxy spend,
- more predictable behavior and fewer parsing failures,
- simpler infrastructure to operate.

Multi‑Repo Core & Data Libraries

QSIM was split into several repositories:

qsim-core — shared domain logic and interfaces
qsim-executor — execution tooling and orchestration
qsim-data — data access, models, and ETL utilities

I was responsible for the core and data modules and for making them integrate cleanly across repos:

Set up a multi‑repo library layout where qsim-core is imported by both qsim-executor and qsim-data.
Defined clear, reusable abstractions for:
- signal definitions and metadata,
- dataset schemas and transformations,
- shared utilities used by flows and services.

High‑Performance Data Access (ClickHouse + SQLAlchemy)

A key goal was to make it easy for downstream tools and notebooks to work with large volumes of signal data.

Implemented data access interfaces using:
- sqlalchemy for relational sources,
- clickhouse-connect for ClickHouse.
Exposed simple methods to retrieve:
- raw or aggregated signals,
- time‑window slices,
- joined model + sentiment data as Pandas or PyArrow DataFrames, ready for research and backtesting.
Tuned schemas and queries so large result sets could be fetched efficiently without manual boilerplate in every project.

Impact

Cost Optimization: Eliminated proxy costs entirely by moving to a multi‑account Reddit bot strategy with a custom load balancer.
Stronger Signals: Reliable sentiment and asset tagging pipeline feeding ClickHouse with clean, queryable data that can be combined with quant model outputs.
Developer Experience: Consistent, multi‑repo core/data libraries and simple data access APIs made it much easier to build new flows, signals, and research tooling on top of QSIM.