Measures for Justice Releases a Powerful Data Collection Tool
Oakland, California (May 31, 2018) – Today, Measures for Justice (MFJ) released an open source software tool for extracting data from PDFs. Code-named Textricator, the tool frees data trapped inside PDFs.
Measures for Justice has developed Textricator over the last two years and has used it to extract tens of thousands of pages of data. Textricator doesn’t require programming skills; rather, the user describes the structure of the PDF and Textricator handles the rest.
Textricator can process just about any text based PDF format–not just tables, but complex reports with wrapping text and detail sections generated from tools like Crystal Reports. You tell Textricator the attributes of the fields you want to collect, and it chomps through the document collecting and writing out your records.
Measures for Justice works to bring transparency to the criminal justice system by collecting existing county-level criminal justice data from arrest to post-conviction, and publishing them on a free, online Data Portal. Textricator is an essential part of that process.
“At times we’ve needed to collect data from PDFs when alternative sources weren’t available. We evaluated great open source solutions like Tabula but it just couldn’t handle the structure of some of the PDFs we needed to scrape. So we built Textricator and it has been incredibly valuable for us. It’s both flexible and powerful and has cut the time we spend to process a large datasets from days to hours,” said Andrew Branch, Director of Technology for Measures for Justice.
MFJ’s Data Evangelist, Steve Spiker, and Senior Developer, Stephen Byrne, announced Textricator at today’s Code for America Summit. MFJ is committed to transparency and knowledge sharing, which includes making its software available to anyone trying to free and share data publicly.
Textricator is an open source project available on github/measuresforjustice/textricator and is released under GNU Affero General Public License Version 3.