Skip to content

investigraph on pypi Python test and package Build docker container pre-commit Coverage Status MIT License

Investigraph

Research and implementation of an ETL process for a curated and up-to-date public and open-source data catalog of frequently used datasets in investigative journalism.

Head over to the tutorial

Abstract

The result is an ETL framework that allows research teams to build their own data catalog themselves as easily as possible and without much coding, and to incrementally update the individual datasets (e.g., through automated web scraping). This process (data import & update) should be possible without programming knowledge, by means of a frontend. However, it cannot be ruled out that for the 1st step (extraction) of an ETL pipeline for a given dataset, some coding is still needed, as each source is individual and may require special parsing. This will be partially addressed by a util library that provides adapters for common data inputs (json, csv, web-api).

Value for investigative research teams

  • standardized process to convert different data sets into a uniform and thus comparable format
  • control of this process for non-technical people
  • Creation of an own (internal) data catalog
  • Regular, automatic updates of the data
  • A growing community that makes more and more data sets accessible
  • Access to a public (open source) data catalog operated by investigativedata.io

Github repositories

Supported by

Media Tech Lab Bayern batch #3