Presentation.html

<!DOCTYPE html>
<html>
  <head>
    <title>Beyond Open Data</title>
    <meta charset="utf-8">
    <link rel="stylesheet" href="assets/style.css">
  </head>
  <body>
    <img src="assets/images/matt-slow.gif" id="matt"></div>
    <textarea id="source">
---
class: center, middle

# Beyond Open Data
## tinyurl.com/beyond-open-data
Shawn Averkamp (@saverkamp)  
Ashley Blewer (@ablwr)  
(Matt Miller) (@thisismmiller) 

???

Hi Everyone! I'm Shawn and this is Ashley, and here in the corner is Matt Miller, who couldn't join us in person today because he had to go talk about open data in Australia. So we all used to work together at a big public library in New York and because we really liked working together, we decided we wanted to propose a project for Code4Lib. This library had done some really innovative work over the years in creating and releasing open data sets and we had each played some role in creating open data, publishing it, documenting it, and even reusing and remixing it. But at the end of those projects, the data was left live on third-party servers, so we thought that for this project maybe we would try packaging it all up and depositing it at the Internet Archive or somewhere else more secure and discoverable. But as we starting talking about the work of packaging and saving it, we started thinking about how few people have actually _used_ open data.
---
class: center, middle

# Open Data: WHO CARES?

# ❤️

???
(Shawn)
 and we started feeling like “who cares?” Why bother? Why do we all publish this stuff? Sure it makes for a good press release, and it's a noble enough aim to require it for a grant, but  I mean, how many of you have done anything with open data? How many of you want to but don’t know how to start? How many of you don’t know if you want to because you don’t know what open datasets are available or what’s in them?. So that really left us at a loss for what to do for this talk. But you all voted for us, so we had to come up with some reasons to care, and here’s what we got. 

---
class: center, middle

# .center[ 😔😔😔 ]

# .center[Why care?] 

# .center[ 😔😔😔 ]

???

So why do this?  I don’t think we have a good answer yet to the why, but we know that as a profession, we haven’t given open data a fair chance for reuse because we haven’t given it the care and thought that we give our other library projects, like websites, catalogs, digitization, digital humanities projects, or digital preservation. But I can tell you why people don't care. 

---
# Barriers to entry
## * JSON? XML? MARC??? No way!
## * What even is MODS?
## * What am I looking at? Where did this come from? Why should I care about this?

???

1) Many of our data formats are hard to work with. If it’s not in CSV and can’t be opened in a spreadsheet, most people aren’t going to be able to explore it or work with it.
2) Our library standards are incomprehensible to many people outside libraries (and to many people inside libraries).
3) There’s not enough context around datasets of collections or documentation helping people interpret it.

---
# Who cares? 

## Librarians 😼
## .right[ 💻 Developers ]
## People who are both 👾
## .right[ 🐶 People who are neither  ]

???
(Ashley)
So, who cares? Or a better question to ask, then, is: 
Who should care? Me, you, us, we care, sure, but it is also our stuff, and our field, and we are dedicated to these things as a core value to our field. 

---
# Who cares? 

## Creators 🎨
## .right[ ☺️ Consumers ]
## Educators ✏️
## .right[ 🎯 Public ]

???
(Ashley)
But who would care? Other people who also care or should care are institutions (yours and others), researchers, other librarians (again, yours and others), Digital Humanities people, educators and their students, and general consumers, people who just want to explore. For example, Hackathon participants who want to engage in a project and have fun writing code in a group but need something solid and practical to work on during a short period of time. They can benefit from this data. I think it's important to have an understanding of the types of people who might be interested in your data but also open to the idea and aware that you can't know all the potential use cases for your data. 

---
class: center, middle

# Experiments 

???
(Ashley) So considering all of these reasons not to care, for this project we decided we wanted to experiment with some open datasets to learn more about how to get people to care, including ourselves. 
---
# Experiments #1: LC/MARC
# Data creator + consumer

![](assets/images/lc-lists.gif)

???
(Ashley)
I'll start with a project Matt worked on, so I'll be representing him and his work here. Matt wanted to work with Library of Congress's MARC data and think about how it could be released in a more broadly user-friendly data format. With the data available not as raw MARC but as a table, database, or CSV, it could be easier for people to understand a dataset, and the data could be explored more easily. 

---
# Matt is great 

![](assets/images/mm.png)
![](assets/images/mm2.png)

???
(Ashley)
BTW Matt is great, he's been doing a lot of interesting work with this data over the past few months since it was released, thank you Library of Congress for putting this data out there.

---
# Matt is great, part 2

![](assets/images/mm3.png)


???
(Ashley)
Some more stuff you can find online. 

---
class: middle, center

![](assets/images/darkpatterns.jpg)

???
(Ashley)
I feel like I didn't initially appreciate the work Matt had been doing in this department until I went to check out some of the public data myself and ran into all these dark UX patterns pushing for me to buy the MARC for $20k, and only at the bottom is there a link to some links with the data split into 41 chunks -- not easy to access and no context.

---
class: middle, center

Installing Datasette:  
![](assets/images/datasette_install.gif)

???
(Ashley)
Anyway, after doing this mapping work (available online, link later), he used Datasette, a framework for building a sqlite database.

---
class: middle, center

![](assets/images/foucault.gif)

???
(Ashley)
Here's a gif of the result. The results are functional but Matt said the source data is maybe a little too big, 10million records, for proper use of this application. MARC data is complicated, but if you think about building your own database for a specific use case and releasing that as a data package, it makes thing much more manageable. Also, he says, this is a super cheap way of building an API. If you put indexes on everything you wanted and stay away from full text search you could totally setup some simple API endpoints for people to use.

---
# Experiments #2: PMA 
# Data consumer

.center[![](assets/images/pma.gif)]

???
(Ashley)
OK, my turn. This study is a general attempt at finding data to build a small thing on top of. I took the lazy way out and came across the Philadelphia Museum of Art's ["hackathon"](https://hackathon.philamuseum.org/) page, which has been set up as a public-facing, experimental API. I appreciated the upfront-ness about this page, where it explicitly states that this is an experiment and that they are still in the early stages of infrastructure-building. "We're sharing our work-in-progress version of what could possibly become our collection data access layer with you." This is nice. 

---
class: middle, center
![](assets/images/explorer.jpg)

???
(Ashley)
The API framework comes with an explorer tool so I can test out API calls right in my browser without having to set things up and deal with raw JSON data, reducing clumsiness and letting me get what I want more quickly, which is an understanding of what kind of data I can see and how I can grab it. What's on display right now is what I initially used to grab up random images, but later I saw another API endpoint that retrieves data about everything currently on display at the museum (or at least at the time this API was created). This is cool, without this explorer or auto-generated documentation, I might not have known that endpoint existed.

So to celebrate how nice this open access to data was, with a quick and easy-to-use API, and how much I appreciate the Philadelphia Museum of Art for making this stuff available to me, as a developer,  ...

---
class: middle, center

[http://bits.ashleyblewer.com/smash-art/](http://bits.ashleyblewer.com/smash-art/)

![](assets/images/smash.gif)

???
(Ashley)
I made this game about smashing up all the art currently on display in the museum. 

The premise of the game is simple (some could say... contrived). You are an Eagle avatar and the goal of the game is to smash up all the art in the museum because you are *extremely* excited about the Superbowl.

---
class: middle, center
![](assets/images/pma1.jpg)

---
class: middle, center
![](assets/images/pma2.jpg)

---
class: middle, center
![](assets/images/pma3.jpg)

???
(Ashley)
So this is an example of the kind of things people could do with YOUR data. Um, but for real, when I was thinking about this talk and building on data, I immediately went with the thing that provided me with the quickest and most understandable access. For me, it's an API. For others, it might be a CSV. 

I'm gonna pass this back over to Shawn to talk about the work her and others have done on the data caretaker front.

---
# Experiments #3: Data caretaker

---
class: middle, center
## NYPL Digital Collections Data Packager
![](assets/images/sa1.png)

???

I'm a metadata librarian at the New York Public Library and I use our Digital Collections API a lot. I know lots of other people don't. I think mainly because it's hard to understand how to get at what you want, and we use MODS XML to describe our items, which is just really hard to understand. It is great that we offer this data openly to the public, but it is definitely not a model for how to publish open data. 

---
class: middle, center
![](assets/images/sa2.png)

???

We also get a lot of requests for pulling down metadata or images for a specific collection, which is not straightforward at all, so I wanted to write a Python script that would take a collection UUID, fetch the metadata and some image details for all of the items in that collection, and then map and flatten that to a CSV table schema, 

---
class: middle, center
![](assets/images/sa3.png)

???

so that users of any skill level would be able to quickly get a sense of what's there and explore the data using existing spreadsheet tools. My goal with this was not so much to produce a finished product for sharing our collection data (I mean, you can use my script--it ain't pretty, but it works), but to learn from the process about what is __lost__ from mapping from MODS to a flat schema and how to make good decisions around creating a generalizable but responsible mapping that preserves as much meaning and context from the original metadata as possible.

---
class: middle, center
![](assets/images/sa4.png)  

`datapackage.json`  

???

So this script also creates what's called a Data Package and generates a json file that describes the collection and the data and outlines the schema in a machine-readable way. I got really into the Frictionless Data initiative a little while back while researching a talk and helping another former colleague wind down a data-intensive grant-funded project.

---
class: middle, center
![](assets/images/sa5.png)  

[http://frictionlessdata.io](http://frictionlessdata.io)  

???

Frictionless Data is an initiative and set of specifications maintained by Open Knowledge International. It's basically a container format for describing a dataset or a bunch of data files in a machine-actionable way. If you're familiar with Docker for software, it's kind of like that, but for data. It's not the only data package specification out there, but what I like about it is that there's a growing community around it and ton of tools for creating data packages, reading them, and analyzing them. I also like that there's a basic schema for interoperability purposes, but it's extensible, so I think there's a real opportunity for creating data package schemas to suit library and cultural heritage needs. 

---
class: middle, center
![](assets/images/sa6.png)  
![](assets/images/sa7.png)  
[http://spacetime.nypl.org](http://spacetime.nypl.org)

???

So after helping set up data packages for our SpaceTime Directory datasets, I wanted to try packaging up all of the other NYPL datasets we've created over the years for the projects that are no longer being maintained. 

---
class: middle, center
![](assets/images/sa8.png)  
[http://menus.nypl.org](http://menus.nypl.org)


???

I started with the datasets from our What's on the Menu crowdsourcing project because I thought it would be easy, but it turned out to be a huge archaeology project and taught me a lot about what kind of documentation data producers could be keeping along the way to put us in a better place for publishing and sunsetting projects. 

---
class: middle, center
![](assets/images/sa9.png)  

[http://curatingmenus.org](http://curatingmenus.org)

???

Because so many people from the project were no longer around, I ended up talking to Katie Rawson and Trevor Munoz, who had documented all of their amazing investigations into deciphering and using this data, to get their insights into what they know about it and what they would have like to see published alongside the data to better understand it. I also relied heavily on the data dictionary they created, which gave me the idea to add source properties to my data package schemas to show the source of each field or the mapping. 

A few things I learned from these conversations and experiences were that we should be getting librarians and curators more involved in helping provide information about the collections themselves, what's been digitized, what's missing, and even what ideas they have for reseach areas where these datasets might be valuable. One thing Katie and Trevor shared with me was that there's a real barrier to scholars using collections-as-data because it often hard to easily find datasets that will be a good fit-for-purpose, and if you can help make this more evident in your data documentation or make your dataset easier to explore, you might have a better shot at finding the right users for your data. 

---
class: middle, center
# PUT AN OPEN LICENSE ON IT

???
One of the biggest takeaways, though, is that even with the best intentions probably you're not going to have the time or resources to add all of this documentation, so put an open license on your data, so at least someone else who cares about it can reformat it, repackage it and share it.

---
class: middle, center

## https://github.com/saverkamp/beyond-open-data
.center[![](assets/images/guide.jpg)]

???
(Ashley)
So there's lot more we can say about this, and we did, and put it all in this github repo. We put together a data care guide for creating datasets that more people with different skill levels can use and that will work with existing applications and tools. It can help you when releasing datasets, or help you advocate within your institution while planning data packages. 

---
# Guide 

<div id="grid9">
  <div><h2>Consistency</h2></div>
  <div><h2>Context</h2></div>
  <div><h2>Licensing</h2></div>
  <div><h2>Planning</h2></div>
  <div><h2>Portability</h2></div>
  <div><h2>Publicity</h2></div>
  <div><h2>Redundancy</h2></div>
  <div><h2>Reproducibility</h2></div>
  <div><h2>Simplicity</h2></div>
</div>


???
(Ashley)
This guide goes into the above issues of who cares about data packages and why you should care about them. It also gives recommendations based in the follow NINE categories: Consistency, Context, Licensing, Planning, Portability, Publicity, Redundancy, Reproducibility, and Simplicity.

---

???

We also included much more information about the experiments we talked about today and we also included some Frictionless Data packages so you can get inspired to create your own. And we hope you do! If you're really into this idea, please feel free to contribute to this repo via a pull request, start a conversation in the Issues tracker, or get in touch if you want to join as a contributor. We don't know how much we'll be doing with this repo, but we'd love to start talking about this stuff more in this community, so we hope you start caring and sharing with us!

---
class: middle, center

# Thank you!

### https://github.com/saverkamp/beyond-open-data 
---

    </textarea>
    <script src="https://remarkjs.com/downloads/remark-latest.min.js">
    </script>
    <script>
      var slideshow = remark.create({ ratio: '16:9'});
    </script>
  </body>
</html>