Lensing the Climate Policy Landscape
On March 22-26 we participated in the Climate Hackathon. Lasting only a week, it was an overload of work but we collaborated with two devs that we met via the DevPost network. It was fruitful; we collectively came up with an applicable product that would surface insights for researchers.
It was a ‘hackathon for good’, astutely organized by Stratiteq, a data analytics company in Sweden, and sponsored by Microsoft Western Europe. With the subtitle “Hack the climate - for a sustainable world!” the program listed challenges, submitted by non-profits looking for solutions, divided into Carbon, Ecosystem, Waste, and Water categories. One project caught our interest:
Climate Policy Radar - Digging deep into climate policy: This challenge involves developing solutions to querying full text documents in the global climate legislation database.
The focus on legal texts from around the world has a lot of overlap. We regard it as fundamentally important. This text is from our submission:
Laws are the software of society. They enable it to function. Laws are code, written by programmers (aka lawyers and legislators).
On a global level there’s a lot of duplication in legal texts across countries. There’s also a lot of embedded knowledge. That can be useful, especially in emerging legal branches such as climate policy.
We want citizens to be able to reuse work that has already been created and proven. In order to do that we need to extract the knowledge from legal policies and render it in human readable form. As our challenge mentor says, we want to provide the evidence “for evidence-based policy making”.
We named our project PoliGrok.
What it does
PoliGrok is a tool to analyze a collection of policy documents. As material we were provided with a couple thousand polices, from countries around the world, in various languages. Our task is to dig into the full text of the documents and find answers to researchers’ questions.
Climate Policy Radar (CPR) had located thousands of policy documents. A team of editors had laboriously created summaries and added metadata, and this was stored in a spreadsheet that contained links to the docs. There was no database. The docs were scattered around countries’ government websites in PDF and HTML format. In order to index and analyze them, the files had to be retrieved. That required infrastructure.
How it coalesced
As we’re not DevOps types we initially looked for a collaborator on a systems level. We found a guy named Lasse, in Roskilde, Denmark who was motivated and is experienced with the Microsoft stack and Azure. Since Microsoft was the main sponsor anything we built could be hosted courtesy of an Azure pass. He and Abhi, connecting from Hong Kong, bent their heads over getting Ambar, our chosen search engine, up and running. It was not simple.
On Monday night, via the Discord, another hacker contacted us, a Data Scientist and full stack dev named Laurence, based in London, who was also eager to engage the CPR challenge. We invited him on board. Then things started falling into place.
Laurence wrote a ‘quick’ Python script to fetch all the PDF files. We figured those would be sufficient for our Proof of Concept so we left scraping HTML web pages for another day.
Meanwhile Lasse had multiple instances setup on Azure, doing trial & error on endless configurations, but none of them worked. On Wednesday evening, Laurence figured out the right combination of ports that had to be open for everything to run. With a collective sigh of relief, the retrieved docs were uploaded, the search engine indexed through the night, and on Thursday morning we had a working solution.
In our discussions about what to build on top of the search engine, one idea grabbed everybody’s fancy: Abhi suggested providing a list of docs that are related to each other and laying them out in a timeline. Typically laws are implemented in stages, e.g. a strategy/plan, a directive, an act, amendment, ratification, and/or announcement. We all agreed that a timeline would be informative.
In our Signal chat Lasse said, “It could also be a graph with a directive maybe as the top node”. That clicked.
Everybody is aware that these climate policies influence each other across the globe. All laws have antecedents and legal precedents. If we can map these relations within the data we could render a graph for a document that resembles a family tree. Now that would be illuminating!
These documents can be logically represented as a graph (a Directed Acyclic Graph, to be specific) either by time (as shown in our demo) or with other contextual relationships. The logical representation of these documents can also be stored in a graph database – file references will be the nodes, the relationships will be the edges, and the dates can be attributes.
Adding a graph database into the stack was not something we could do in a day. Kit mocked up a user interface that, after a search, displayed a timeline for a law showing its stages. We used this in the demo video and explained how visually mapping the relations for a doc would provide insights for researchers.
Friday was the deadline day and all the teams, comprising 420 registered participants from 51 different countries, had to submit their solution before 17:00 CET.
In the morning we collaboratively edited the submission text on a Pirate Pad, wrote a screenplay, and gathered remotely at noon to make the required 3 minute demo video. We submitted before the deadline and then took a well deserved weekend off. On Sunday Stratiteq would broadcast the final stream where the winning projects would be announced.
Our submission didn’t win a prize but what was gratifying was, it synced with the other groups’ efforts on the Climate Policy Radar challenge. All the teams perceived the interrelations between the docs as a consequential aspect of the policy collection, and their solutions addressed that in various ways.
Lastly, we had a great time hacking together. It was intense work but a spirited and fruitful collaboration.
What’s next for PoliGrok
PoliGrok forms the bedrock for analyzing these documents. By using complex PDF extraction and OCR techniques it’s made these PDFs and images easily searchable, ready for annotation with tags, and machine readable. This will throw the doors open for further analysis, both by humans and machines with advanced ML and NLP paradigms.
We believe this project, in a future form, could fight Greenwashing not only at a communication level, but all the way to the implementation of laws.
Add additional metadata and tags for each document
More documents, including the HTML pages
Additional UI and timeline/graph feature
Actionable feed to the end user about processing errors so documents with challenges can be addressed.
At some point we’ll contact Climate Policy Radar to check in and see how it’s going. Perhaps we can contribute more ideas, tools, or hacks.