Submit a Dataset
The simplest way to submit data is to use our online submission form. Provide the requested information and hit Submit; we’ll review the submission and eventually add it to the site, if it’s suitable.
Our overall goal is to provide a collection of useful, real-world datasets from a variety of application areas that can be used as in-class examples, homework assignments, or course projects. The repository can always use new datasets, so we welcome submissions. Datasets should meet a few requirements:
- The data must be publicly shareable. There should not be any licensing restrictions that prevent us from sharing it, or any human subjects ethics concerns or other limitations. Look for datasets marked as being in the public domain or with licenses like the Creative Commons Attribution (CC-BY) license.
- The data should be in a standard, easy-to-use format, like CSV. If you have to clean the data from an original form, include the R script you used to clean it with your submission.
- There should be good motivation for interesting analyses of the data that would be appropriate for a course.
- The data should be less than 100 MB, as large sizes pose technical problems and are inconvenient for students. Files larger than a few megabytes should be compressed when practical. R (
read.csv()
) and Python (pandas.read_csv()
) can read.csv.gz
files directly, so gzip compression is a good choice for larger files.
If you’re familiar with R Markdown or Quarto, you can instead prepare the data description page yourself from a template. Read on for detailed instructions.
Getting Started with Quarto
This website is built using the Quarto system and automatically rendered into a website. Data description pages can be made directly from a template file we provide.
You can get the template in two ways:
- Copy the
_dataset-template.qmd
file from our GitHub repository and save it on your computer. Once you’re done, you can email us the file and the data. - Fork our GitHub repository into your own GitHub account and edit it like any other Git repository. Once you’re done, you can submit a pull request.
The Dataset Template
The dataset template is a Quarto file; Quarto is much like R Markdown and is supported by recent versions of RStudio. Quarto files can contain R code, just like R Markdown, so your data description can include graphics and tables generated by R code embedded in the file, if you think this would be helpful to illustrate important features of the data. As you fill out the template, you can use RStudio’s “Render” button to see a preview of the finished page.
The template asks for:
- Basic metadata, such as the statistical methods applicable to the data, a short title, and your name. (All submissions are credited with the name of the submitter.)
- A description of the problem. This should be sufficient to motivate the analysis and help students understand the setting.
- A description of the data and the variables. It should be clear what each row of data represents and what all variables mean. Units of measure should be included whenever possible.
- References to the original source. If the source is an academic paper and the data is deposited in a third-party repository (such as Zenodo, Figshare, or Dryad), include references to both the paper and the archived dataset. When available, include the DOIs of the references.
The template is meant to prevent a common problem with course datasets: they get passed down from instructor to instructor, and eventually all information about the original source is lost. The dataset may be presented to students without context or important details (like units), and the instructor may not be able to find the original data or references to answer questions about the data. By providing sufficient detail, we can make datasets reusable for years to come.
Submitting
Once you’ve filled in the template, you can email the template file, the data, and any cleaning scripts to the repository editor (currently Alex Reinhart).
Alternately, submit a pull request to our repository on GitHub. Save the Qmd with a meaningful name and place it in one of the top-level category directories (like medicine/
or astronomy/
). Place the data files in data/
. Double-check that you’ve committed everything (and nothing extraneous) and open a pull request as usual.