SSDN Improves Metadata Harvest with Upgrade to Apache Airflow

By Matthew Miguez, Metadata Librarian, FSU Libraries

The Sunshine State Digital Network is proud to announce a new harvest system. The system is a combination of two tools working together: Apache Airflow and manatus.

  • Apache Airflow is an open-source system for managing and monitoring data pipelines.
  • manatus is a configurable metadata harvester and mapper written in Python and maintained by SSDN. It is named after Florida’s most iconic and docile marine mammal, the manatee.
The Airflow dashboard

The Airflow dashboard

SSDN began seeing a need for a more robust and flexible harvest process in 2019. The scale and variety of metadata contributed by partners was out-pacing our small team’s ability.

One issue was the previous harvester was designed to run as one monolithic process. Each partner was harvested and transformed one at time, in one go. Any issues arising during the process required applying fixes and restarting the harvest from the beginning. Each harvest could take anywhere from 12 to 36 computer hours, not including testing, troubleshooting, and bug-fixes.

Another issue was that the the functions for mapping metadata were rigid and verbose. Mapping from the common library standards (MODS, DC, QDC) was largely a one-size-fits- all affair. Requests for customization or adjustments to the metadata transformation proved difficult to achieve, due to size and inflexibility of the maps.

Finally, a high level of technical expertise was required to access the harvest system and initiate the harvest process. This put a hard limit on the number of people who could join the process.

Given these challenges, SSDN began the process of looking for a new aggregation platform. We were inspired by PA Digital’s move to Apache Airflow, and decided to follow suit.

We started this process by rewriting and renaming our harvesting utility. Manatus handles all of the data tasks–pulling in and transforming our partners’ metadata. We invested the time to build the underlying library on an object-oriented design. This makes adjusting the behavior of harvest functions and transformation maps much simpler to implement.

Airflow serves as a launcher and monitor for the manatus processes as well as other necessary steps—updating configurations, cleaning up before and after harvests, and any additional enhancement processes that need to be run.

Airflow’s architecture encourages tasks to be made as small and as atomized as possible. This aids identifying and troubleshooting problems. Tasks that aren’t dependent on one other can be run in parallel—saving time. The new manatus harvests managed by Airflow are averaging about an hour and a half of machine time to run. Additionally tasks in Airflow can fail independently and not affect other tasks in the data pipeline, fixes can be applied and a tasks can be restarted. Finally, airflow ships with a web-accessible dashboard. Any team member can initiate and monitor a harvest. Errors are clearly communicated and easy to pass along to our developers.

We have one harvest in the bag, and are working through our documentation. We’re excited for the extra time saved to devote to other forthcoming projects.

Relevant repositories

%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close