Investigraph
Research and implementation of an ETL process for a curated and up-to-date public and open-source data catalog of frequently used datasets in investigative journalism.
Abstract
The result is an ETL framework that allows research teams to build their own data catalog themselves as easily as possible and without much coding, and to incrementally update the individual datasets (e.g., through automated web scraping). This process (data import & update) should be possible without programming knowledge, by means of a frontend. However, it cannot be ruled out that for the 1st step (extraction) of an ETL pipeline for a given dataset, some coding is still needed, as each source is individual and may require special parsing. This will be partially addressed by a util library that provides adapters for common data inputs (json, csv, web-api).
Value for investigative research teams
- standardized process to convert different data sets into a uniform and thus comparable format
- control of this process for non-technical people
- Creation of an own (internal) data catalog
- Regular, automatic updates of the data
- A growing community that makes more and more data sets accessible
- Access to a public (open source) data catalog operated by investigativedata.io
Github repositories
- investigraph - The meta repo from which this page is rendered
- investigraph-etl - The main codebase for the etl pipeline framework based on prefect.io
- investigraph-datasets - Example datasets configuration
- investigraph-site - Landing page for investigraph (next.js app)
- investigraph-api - public API instance to use as a test playground
- runpandarun - A simple interface written in python for reproducible i/o workflows around tabular data via pandas
- ftmq - An attempt towards a followthemoney query dsl
- ftmstore-fastapi - Lightweight API that exposes a ftm store to a public endpoint.