If you are planning to publish research in which you analyzed data that is (1) derived from NEON data products, (2) derived from NEON samples, (3) collected through the NEON Assignable Assets Program, or (4) collected at a NEON site, then we recommend you continue reading this document. We include some best practices and practical tips for getting credit -- not only for your research, but for your data and scientific workflow! In addition, a Data Management Plan (DMP) is an essential planning tool for any research project and is often required by funding agencies. This page can help you decide what to include in your DMP.
Why should you publish your data?
- Get credit for your work by making your data citable with a DOI.
- Provide primary data for synthesis science and gain more exposure for your work.
- Preserve your data in a repository, which will provide more stability, permanence, and provenance tracking than your local archive solution.
Why do we need best practices for publishing data?
Transparency and trust in research outputs is crucial for science! Our goal is to promote FAIR (Findable, Accessible, Interoperable, and Reusable) data principles by recommending data management and publication best practices for NEON data users. Source data sets, as well as derived data products and processing code related to NEON, should be easily discoverable and accessible to everyone in the user community. Therefore, we recommend that a project using NEON-derived inputs should be managed and published for others to access in the future.
Publishing outputs derived from NEON data or samples
When you use NEON data (including data about samples), you’ll likely build an analytical workflow (Stoudt et al. 2021, Brun et al. 2020). A workflow may be within a spectrum ranging from a documented process in a spreadsheet to a custom-coded application. The inputs may include provisional or released NEON data as well as data from sources external to NEON. The outputs will include your final derived datasets and may also include intermediate data sets created during your analysis that are necessary to maintain the reproducibility of your analysis.
There are many resources for learning how to manage these products before, during, and after research and how to publish in a FAIR data repository, including:
- The Carpentries
- Environmental Data Initiative: Five Phases of Data Publishing and Data Package Best Practices
- DataONE Training
- American Geophysical Union (AGU) Publication Resources
What should you publish?
Final data outputs. For most NEON-derived outputs, we recommend using text-based files in a non-proprietary format (e.g., .txt, .csv, .json formats) or file types that are standard within the community of practice (e.g., remote sensing), with appropriate metadata. For larger data files, visit the Special Considerations -- Large Data Packages section below.
Source data or references. Many datasets downloaded from NEON or other repositories are associated with a Digital Object Identifier (DOI) that will permanently resolve to an archived copy of the data. Here, the DOI may be supplied in place of the actual data. For other datasets that do not have a DOI or other permanent, resolvable ID, you may wish to include the source data with your data package. For example, NEON data published after the most recent release are provisional and may be corrected at any time until they are included in an annual release. Pre-corrected, provisional NEON data may not remain available to the public after correction; thus, to maintain reproducibility, it is necessary to archive the provisional source data used in your analysis or workflow.
Workflow(s). In addition to source data, we recommend that any workflows (e.g., code, containers) necessary to reproduce an analysis are archived and readily available to other researchers in a manner that meets the ethical considerations and norms of scientific research. Provide a license, preferably an open source license, for code that you have generated or reference to any proprietary code or workflows used.
Choosing a repository
We recognize that no single repository will meet all the needs of everyone in the NEON data user community. However, we recommend choosing a community accepted domain repository that provides a high standard in metadata support and, therefore, FAIR (Findable, Accessible, Interoperable, and Reusable) data. A domain repository implements specific search capabilities making your data Findable and Accessible. Rich science metadata and community supported data standards assure that data are Interoperable and Reusable and reduce chances of misuse and misinterpretation.
We have partnered with the Environmental Data Initiative (EDI), where data submission is free and supported by the National Science Foundation, to serve as a repository that is suitable for many research projects that leverage NEON data. EDI provides support to ensure that your data are well documented and formatted. EDI accommodates the inclusion of processing code in published packages, makes datasets citable by providing a DOI, and can handle large datasets. More information is available below.
Other repositories may also be suitable for your data depending on the research subject.
Specialized community supported repositories (this is not an exhaustive list):
- Knowledge Network for Biocomplexity (KNB)
- National Center for Biotechnology Information (NCBI)
- NCBI Sequence Read Archive (SRA) for large sequence files
- Arctic Data Center
- U.S. Department of Energy (DOE)
- NASA’s Oak Ridge National Laboratory Distributed Active Archive Center for Biogeochemical Dynamics (ORNL DAAC) - particularly for research supported by NASA. Other DAACs may also be applicable.
Data repositories vary in their support of standard metadata, or information about datasets. The most basic systems allow for adding appropriate keywords to a project. More FAIR systems allow adding structured, machine readable information (e.g., Ecological Metadata Language [EML]). To support reproducibility, we recommend the contents of your published metadata include at minimum:
- Customary information about your project, including the name(s) and contact information of the author(s) and any other associated people, dataset title, date ranges, project information, and specifications about the data files associated with the project
- A thorough description of the analytical process used
- Keywords, preferably drawn from a community supported and controlled vocabulary or ontology, that will help others discover your dataset
- DOIs associated with any NEON data releases that were used
- Citations for any provisional NEON data files that were used
- DOIs that point to other data that were used or the actual data files
To improve findability of your data it is recommended to include these specific pieces of information in the metadata:
- NEON-specific keywords:
- NEON data product ID(s) (e.g., “DP1.00001.001”)
- Locations: NEON four-letter site code(s), e.g., "ABBY", "BART." In the next section we provide tips on how to import site-specific metadata into EDI's ezEML tool.
- Variables: It may be helpful to retain the variable naming conventions that NEON uses, to make it easier for users to trace processing from source to derived data, unless one of the goals is to convert naming to another standard convention such as the Semantic Web for Earth and Environmental Terminology (SWEET) or Environment Ontology (ENVO) ontologies.
- NEON’s Research Organization Registry (ROR) ID (04j43p132)
- DOIs for any source data, including released NEON data products or prototype data
- Applicable grant identification numbers
- Accession numbers and/or unique identifiers for samples, including those located in the Biorepository and genomic sequences. These may need to be handled as an inventory table in the data section rather than the metadata if too numerous (for more information, read detailed discussions of linking to data in other repositories).
Publishing with the Environmental Data Initiative (EDI)
Before submitting your data for publication with EDI, please review this guidance about the five phases of data publishing, which are: (1) Organize, (2) Clean, (3) Describe, (4) Upload, and (5) Cite.
For step three, Describe your data using the Ecological Metadata Language (EML). Tools to help you generate EML metadata are described in detail here. For a data package derived from NEON source data, we recommend the following steps:
- Load NEON-specific EML
- Work through the sections in the ezEML editor. Here are some recommendations:
- The title for your data package should be descriptive and, when appropriate, indicate that it was derived from a NEON data product, e.g., “Dissolved greenhouse gas concentrations derived from the NEON dissolved gases in surface water data product (DP1.20097.001).”
- Upload your data tables to the ezEML tool and be sure to edit the column properties to define the fields in each table.
- NEON should be listed in the “Associated Parties” section as a “data provider,” and the entry should include NEON’s ROR ID (04j43p132). This is pre-populated in the template. Please add additional data providers in this section if/when appropriate.
- The abstract should include:
- A description of what data are contained in the dataset, which NEON data products serve as source data, and identify any other data sources.
- An acknowledgement statement, such as: “NEON is sponsored by the National Science Foundation (NSF) and operated under cooperative agreement by Battelle. This material is based in part upon work supported by NSF through the NEON Program.”
- A list of NEON source data product citations (find guidance on how to cite NEON data here). EDI staff will use these to properly track data provenance. This list will not be displayed in the final abstract. Example:
NEON (National Ecological Observatory Network). Dissolved gases in surface water (DP1.20097.001). https://data.neonscience.org (accessed April 6, 2021) https://data.neonscience.org/data-products/DP1.20097.001
- The geographic coverage is pre-populated in the NEON template with all NEON sites. Please “remove” any NEON sites not included in your data package.
For step 4, Upload: When your ezEML package is complete, please select “check metadata” and work through any outstanding issues identified with your data package. Then “submit to EDI.” Submitting will send the submission to an EDI data manager for review, and you will receive an email with a “proof” data package or questions.
And finally, for step 5, Cite your data package once it is published with a DOI!
Some of these practices may be applicable to other repositories that use EML, including KNB and ADC.
Special Considerations for Specific Data Types
Data Collected Through NEON’s Assignable Assets Program
Investigators utilizing NEON Assignable Assets, including Observational Sampling Infrastructure (OSI), Sensor Infrastructure (SI), the Airborne Observation Platform (AOP), the Mobile Deployment Platform (MDP), Excess Samples, or who collect data at NEON sites are encouraged to make their research data freely and openly available as soon as possible, or within two years of completion of the NEON Assignable Asset data collection. The data should be published by a public access compliant repository or in accordance with funding agency requirements. Please cite the public repository where the data are archived and acknowledge the data collection through the NEON Assignable Assets Program as found within NEON’s Acknowledgement and Citing Guidelines. The request form for AA resources asks each lead investigator to describe plans for data collection, storage, and publication.
Data Associated with NEON Biorepository Samples
The NEON Biorepository data portal is capable of hosting or linking to many forms of sample-associated data, including images, species determinations, genetic sequences, and trait data. Researchers using NEON Biorepository samples are strongly encouraged to work with NEON Biorepository bioinformaticians and become portal managers in order to disseminate their data to the public. Once linked to a sample occurrence record in the NEON Biorepository data portal, these data are then published to biodiversity data aggregators, such as the Global Biodiversity Information Facility (GBIF) portal, where citations of sample record usage are recorded. All publications resulting from work on NEON Biorepository samples can be published within the relevant sample occurrence records as associated references, and lists of samples used in a published work can be highlighted as a special dataset within the NEON Biorepository data portal. Special datasets and occurrence records therein can be annotated or expanded at any time, but static copies can be downloaded as Darwin Core Archives.
Large Data Packages
Data packages that are large enough to be difficult to download from a repository may require special considerations. These could include data derived from AOP data products, data collected using the AOP and/or MDPs available through the NEON Assignable Assets program, spatial data products that are trained on or interpolate NEON data, and outputs of models that are assimilating NEON observations. To ensure long-term data integrity, a repository may recommend that these types of data outputs are archived using "cold storage," where multiple copies of the data are created on multiple physical hard drives that are then archived in multiple physical locations or on cloud servers.
We also recommend such data sets are made available via discipline specific data repositories where data may be more accessible, but longevity may not be guaranteed.
- RASTER type data, including spectrophotometric, camera, and other remote imagery data sets can be submitted to Google Earth Engine. Users should be aware of any restrictions that might be placed on data use now or in the future.
- LIDAR data packages can be submitted to OpenTopography
- Large datasets can be archived with EDI, where the metadata associated with large data packages will be made discoverable on the EDI data portal, similar to other NEON related data products that are archived with EDI. Investigators will need to provide their own hard drives and must consider that cost when creating the budget for their project.
- Utilizing concurrent options may be of value. For example, a PI may wish to publish with GEE to make data easily discoverable and accessible, but also archive with EDI using cold storage to ensure data longevity, integrity, and accessibility beyond what is promised via the GEE platform.
Forecasting ecological responses to environmental or biological drivers can be data intensive, as large amounts of data are ingested and analyzed frequently. The Ecological Forecasting Initiative, a grassroots consortium aimed at building and supporting an interdisciplinary community of practice around near-term (daily to decadal) ecological forecasts, has been developing data management guidelines specific to forecasting.
What else would you like to learn? Please contact us with any questions.