| Abstract: |
Transportation agencies increasingly generate massive streams of operational, ridership, and sensor data, yet the technical effort and human resource costs required to process and analyse these heterogeneous sources often exceed the capacities of small- to medium-sized city operators. To address this gap, this paper presents the first steps of a work in progress towards an open-source, modular, three-layer framework that (i) ingests raw datasets while preserving their native formats, (ii) enforces schemas, supports versioning, and performs incremental transformations to produce clean, curated, queryable tables, and (iii) enables interoperable access for downstream modelling, visualisation, and decision-support workflows through SQL engines and programmatic client libraries. The pipeline leverages widely adopted open-source building blocks, including object storage, distributed processing engines, and transactional table formats, and is orchestrated through standard cloud deployment practices, enabling rapid deployment on commodity cloud or on-premises hardware. A real-world evaluation on ten years of public transport data from a Spanish municipal agency, combining Automated Fare Collection and geospatial sources, demonstrates that the framework can support advanced passenger behaviour analytics, including passenger profiling, while remaining scalable, cost-effective, and easy to operate. By providing a reusable, standards-based stack, the framework lowers implementation barriers, mitigates vendor lock-in, and empowers agencies to transform raw feeds into actionable analytics, fostering data-driven decision-making for improved service quality, resource optimisation, and sustainable public transport planning. |