The state of the data in citizen science

Citizen science has contributed to biodiversity research and monitoring for hundreds of years. Still, the recent increase in scale, scope, diversity and number of citizen science projects highlights the challenge of designing and implementing good practices around data collection and data curation. The Committee on Data for Science and Technology of the International Council for Science (ICSU-CODATA) and the World Data System (WDS) recently founded a joint Task Group to understand and support good practices for citizen science data validation, data cleaning and curation, and, shortand longterm data management. Research projects conducted by the ICSU-CODATA-WDS Task Group include the development of an initial typology of citizen science data generating tasks, and an exploratory landscape analysis of the state of the data in citizen science. The landscape analysis found that citizen science projects use a wide range of strategies for data validation at numerous stages of the scientific research process. In comparison, practices for data documentation, curation, and long-term management are less advanced. This may limit data discovery and re-use. This work compliments the planned and ongoing efforts of the TDWG Citizen Science Interest Group to advance biodiversity informatics for citizen science. Presenting research on the state of the data in citizen science can promote cross-pollination between the ICSUCODATA-WDS Task Group and the biodiversity community, and encourage researchers ‡ § © Bowser A, Cooper C. This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. and practitioners to work together to advance citizen science data quality, standards, and interoperability.


Introduction
Citizen science refers to a spectrum of activities where scientists and members of the public collaborate in scientific work. While conversations to more concretely define and bound "citizen science" are underway (Eitzel et al. 2017), we consider citizen science inclusive of projects across domains and scales, to include both local, place-based initiatives and broader, crowdsourcing solutions. 1 Though the phrase citizen science entered the vernacular in the mid-1990s (Bonney 1996;Irwin 1995), members of the lay public have been involved in science for centuries. Driving and enabling factors for the current proliferation of activities include the rise of the Internet; increased smartphone penetration along with the spread of other information and communication technologies (ICT); recognition from scientists that involving volunteers can support and augment their work; funder requirements for public engagement or outreach; and the rapid increase in global education (Silvertown 2009;Cooper 2016). Millions of people now contribute to citizen science each year. SciStarter, a United States (US)-based directory of citizen science projects and related activities, recorded an average of 30 projects added per month during the course of 2018.
Educators in formal and informal settings introduce citizen science with the goal of enhancing topical knowledge and public understanding of science (Bonney et al. 2016). Scientists in academic institutions incorporate citizen science into their research programs, with bibliometric analysis demonstrating the exponential growth of publications referencing citizen science in recent years (Follett and Strezov 2015). Citizen science is also enjoying increased attention on the policy level, as seen in Europe and the US (Nascimento et al. 2018). Members of professional and public communities engage in diverse citizen science activities for a wide range of reasons. Some seek to advance the research enterprise, for example, by enabling data collection on scales and resolutions not possible through professional activities alone (Cooper et al. 2012). Others seek to bridge the science-society gap by making professional researchers and citizens more accountable to each other (Irwin 1995).
The growth and formalization of citizen science is supported by professional associations based in Australia, Europe, and the US, as well as emerging associations in Asia, South America, and Africa. These organizations provide convening power, and help collect and distribute best practices on the science of citizen science, including through conferences and a peer-reviewed journal (Storksdieck et al. 2016). As further evidence for global reach, the Citizen Science Global Partnership was launched in collaboration with United Nations Environment Programme as a network-of-networks supporting global coordination and linking citizen science to the UN. Sustainable Development Goals (SDGs). Beyond the establishment of new organizations, existing governments and NGOs are developing resources for their employees, grantees, and partners to conduct citizen science. For example, the US. Federal Government and partners launched the CitizenScience.gov platform in 2016, which included a toolkit, a catalogue of federal citizen science projects, and a community page (Nascimento et al. 2018).
One common theme across these citizen science initiatives is the central importance of data collected or generated by the efforts of volunteers who are not typically from scientific professions. As the common denominator in nearly all citizen science projects, data are the foundation of citizen science: Without proper handling of such data, projects will have limited success. However, the potential to generate knowledge through primary research and the reuse of data, and to inform evidence-based decision-making, will be limited if the field does not further advance norms around high-quality data collection and management. Several researchers have offered case studies of individual citizen science projects that excel at various aspects of data collection, management, and use. These case studies generally document effective practices within a specific project, and sometimes offer more generalized recommendations in areas including avian presence and distribution (Sullivan et al. 2017), marine debris (van der Velde et al. 2017), urban tree inventories (Roman et al. 2016), and invasive species (Crall et al. 2011).
Other researchers have identified and analyzed, for example, data quality practices and fitness-for-use assessments across citizen science initiatives (see for example Specht and Lewandowski 2018;Kelling 2018;Aceves-Bueno et al. 2017;Kosmala et al. 2016;Lukyanenko et al. 2016;Sheppard, Wiggins, and Terveen 2014;Wiggins et al. 2011). Still others delved into issues related to standardized data collection (Higgins et al. 2018;Sturm et al. 2017), data management Bastin et al. 2017) or concepts like fitness to purpose or fitness for use . But with the exception of Schade et al. (2017), who collected data focused on citizen science data access, standardization, and preservation via an online survey, little published work in the context of citizen science evaluates practices related to the full data lifecycle as defined in Box 1.

Box 1: Data Lifecycle and Data Management.
This box provides definitions of different aspects of the data lifecycle and data management. The purpose is to provide a high-level overview for citizen science researchers who may be less familiar with terminology and approaches taken by the research data community.
Data acquisition: Collection, processing, and curation of scientific information. Acquisition can occur through human observation or automated sensors.
Data quality: Quality assurance/quality control (QA/QC) checks taken across the data lifecycle, from acquisition to archiving to dissemination. These include validation, cleaning, and checks for data integrity.
Data infrastructure: Tools and technologies including hardware and software that support data collection, management, and access.
Data security: Methods of protecting data from unauthorized access, modification, or destruction through proper system security and staff training.
Data governance: Rules for the control of data including provisions for stewardship, privacy, and ethical use, including ensuring the protection of personally identifiable information (PII).

Data documentation:
Discovery metadata (structured descriptive information about data sets used by catalog search tools) and documents describing data inputs and methods used to develop data sets.

Data access:
The conditions required for users to find and use data, including metadata and licensing. The research community has variously adopted standards of open access or FAIR (Findable Accessible, Interoperable, and Reusable) data. This includes long-term preservation.
Data services: Tools and web-based applications built with data sets and computer code.

Data integration:
The process of combining data from different sources, which requires interoperability enabled through the use of data and service standards.
For additional information on any of these aspects, visit the World Data System training resources page (https:// www.icsu-wds.org/services/training-resources-guide) or the ESIP Federation data management training clearinghouse (http://dmtclearinghouse.esipfed.org/). We sought to advance conversations about the state of the data in citizen science through structured interviews with 36 citizen science projects around the world, representing many scientific domains, and to provide recommendations for improved practice. This research was conducted by citizen science and data experts working under the auspices of the International Science Council Committee on Data (CODATA) and World Data System (WDS). Together, CODATA and WDS formed a task group, Citizen Science and the Validation, Curation, and Management of Crowdsourced Data, in 2016. The objectives of the task group were to better understand the ecosystem of data-generating citizen science, scientific crowdsourcing, and volunteered geographic information (VGI) to characterize the potential and challenges of these developments for science as a whole, and data science in particular.
Following this introduction, we review current trends in science and scientific data relevant to citizen science, and then examine current issues around data quality and fitness for use in citizen science. The first contribution of this paper is an exploratory empirical investigation into the state of the data in citizen science. We present our methods and results of the survey of practices before discussing the results. This paper also contributes practical and research-oriented recommendations. As an initial step toward offering concrete guidelines, we identify a list of good data management practices that may be helpful for citizen science projects to consider, particularly if they wish to elevate the value of their data for reuse. We also suggest areas where more research is needed to understand more about our findings, and maximize the impact of this steadily growing field.

Shifting norms around open and FAIR
Norms and practices governing data management are still emerging in conventional science, and are not yet firmly established across disciplines. One important development in scientific research is the emergence of open and FAIR (Findable, Accessible, Interoperable, and Reusable) principles. Broadly, open science is research conducted in a way that allows others to collaborate and contribute (OECD 2020). As a movement or paradigm, open science can be traced to the Scientific Revolution of the late 16th and early 17th centuries when rapid dissemination of knowledge became a guiding principle for scientific research (David 2008). Contemporary advocates argue that open science strengthens research by facilitating reproducibility through transparency (Munafò et al. 2017), and makes science more accessible to stakeholders including the general public, though important power differences often remain (Levin and Leonelli 2017). Recently, open science has been accelerated by policy initiatives in Australia, the European Union, the United Kingdom, and the US (Tenopir et al. 2015).
As Second, while open science can be described as a paradigm, or an approach to scientific research, FAIR is more prescriptive, offering concrete guidelines and even checklists for researchers to follow (Wilkinson et al. 2016). Practices around cataloguing and metadata documentation help make data FAIR.

The state of the data in scientific research
Understanding the current state of data management is critical for understanding and charting progress moving forward. Notably, the larger scientific community has only recently begun to adopt practices related to open and FAIR data. One benchmark study of 1,329 researchers across scientific domains explored practices and perceptions of data sharing (Tenopir et al. 2011). 2 At the time of publication in 2011, 29% of respondents had data management plans, while 55% did not and 16% were uncertain. Regarding data access, 38.5% of respondents stored their data in an organization-specific system. A follow-up study conducted shortly after National Science Foundation (NSF) policies went into effect reported mixed progress. Perceptions of the value of data sharing increased, but so did perception of threats, and progress on self-reported practices was mixed (Tenopir et al. 2015).
A number of factors contribute to suboptimal data management in scientific research. While researchers are generally satisfied with tools for short-term storage and documentation of their data, access to longer-term repositories may be lacking (Tenopir et al. 2011), and citizen science practitioners may not be familiar with the many domain-specific repositories-though in recent years open repositories such as Dryad and FigShare have grown in popularity. Beyond the provision of technical tools, "Barriers to effective data sharing and preservation are deeply rooted in the practices and culture of the research process as well as the researchers themselves" (Tenopir et al. 2011, p.1). Incentives are often missing for researchers to invest the time and effort required to make their data open or FAIR, since data cleaning and documentation are time-consuming activities that lack the same incentives as, for example, publication. Further, an academic culture that tethers scholarly publication to professional milestones like the tenure process may actively disincentive openness and sharing if researchers fear getting scooped. And volunteer citizen scientists are not necessarily motivated by the same incentives as researchers, but rather factors such as personal interest, learning, creativity, socialization, and the desire to contribute to scientific research (Jennett et al. 2016;Rotman et al. 2012).
Researchers have also started to study data reuse, defined as the use of data by the original data collector or third-party users, sometimes by combining the data with other data, for the same or different purposes for which they were originally collected. One study found that perceived utility of a data set was the single strongest factor leading to reuse, and concluded that the value of reuse should be more widely demonstrated to the academic community (Curty et al. 2017). Efforts to make data discoverable, promote the use of strong metadata, and improve norms and practices around data attribution and citation could all lead to more data reuse. Regarding citation, the use of persistent identifiers (e.g., Digital Object Identifiers [DOIs]) can ensure that researchers are able to refer to a unique data set produced at a given point in time by providing persistent URLs to data that are retained even if, for example, data moves from a project website to a longer-term repository. This is important for traceability in scientific findings as well as for appropriate attribution.

The state of the data in citizen science
The White House memorandum Addressing Society and Scientific Challenges through Citizen Science and Crowdsourcing (Holdren 2015) offers three core principles for citizen science: Contributions of volunteers should be 1) fully voluntary, 2) meaningful, and 3) acknowledged. Similarly, the European Citizen Science Association (ECSA)'s 10 Principles of Citizen Science (ECSA 2016) include "citizen science project data and metadata are made publicly available and where possible, results are published in an open access format." These codes suggest that data sharing, including through publication, may be necessary to fulfill a core best practice of citizen science. Some researchers document the importance of report-backs, or the process of sharing individual and collective results with volunteers in ways that are meaningful and useful to them (Bonney et al. 2009;Morello-Frosch et al. 2009;Gallo and Waitt 2011). Related, there is often a noticeable commitment within citizen science projects to publish academic publications in open access journals (although fees can be a barrier to follow-through). However, the realities of data sharing may suggest differently: One study of open biodiversity data available through the Global Biodiversity Information Facility (GBIF) found that citizen science datasets were among the least open (Groom et al. 2016).
Beyond open data, a significant portion of research on data practices addresses data quality. The topic of data quality is a key concern in the scientific enterprise because perceptions of poor data quality can influence the willingness of scientists or policy makers to trust the results of citizen science. In the context of a research project, the construct of data quality means that data are high enough quality to serve a project's goals: there are no universal criteria to establish quality in scientific data because it is inherently contextual. In acknowledgement of this reality, the concept of fitness for use is frequently applied in citizen science (Kosmala et al. 2016), with the focus on designing project processes with the end in mind . For example, in air-quality monitoring, lowcost sensors cannot currently compete with professional instruments for achieving precision and accuracy at the levels necessary for regulation (Castell et al. 2017). Therefore, one goal of citizen science air-quality projects may be to get regulators to take notice when systematically collected data indicates a potential problem meriting further investigation. Low-cost (including commercial or open-source/do-it-yourself) sensors are of suitable quality to be fit for this, and often other, purposes.
When used to describe an individual data record, data quality typically refers to the accuracy and precision with which a data value represents a measurable parameter of an entity or phenomenon. At a whole dataset level, data quality refers to all attributes being accurately measured using a standard/common protocol and accurate instrumentation. 3 Higher-quality data accurately and precisely represent reality, whereas low-quality data are a poor or inconsistent representation. Errors in measurement can be random (scattered) or systematic (always wrong or biased in the same direction), and they can arise owing to poor instrumentation (imprecise, poorly calibrated, or old) and operator errors, which usually introduce systematic biases in data. Therefore, measurement accuracy may be affected by several factors, including the training and competence of volunteers; sensitivity, calibration and construction quality of measuring instruments; establishment of a consistent sampling frame; the methods used in taking/determining measurements and their consistency over time and space and across volunteers; and delays between sample collection and measurement (in lab settings). With respect to field-based observational facts such as species occurrence recording, competency and attention to detail by citizen scientists can affect factors such as correct identification, spatial accuracy, precision and uncertainty, and date/time precision. In addition, third-party perceptions of data quality can be affected by whether records have been verified or validated by experts or if there are methods or additional data sets available to cross-validate or even triangulate results.
However, the actual quality of data has significance only in the context of usage. This is a relative concept that relates to fitness for use (Chapman 2005), i.e., for some applications, low-quality data may be acceptable. One of the underlying premises of citizen science in the field of biology, for example, is that scores of amateur scientists can collect data over much larger areas and longer periods than would ever be possible by highly trained biologists alone. Thus, in some studies, the lower quality is balanced by a far wider scope, demonstrating that almost all data has value depending on the purpose for which it is to be used. In addition, citizen science data may be analyzed along with other scientific or instrumental observations as a method of either validating or cross-validating the data, or complementing data of known quality with a larger sample size.
Researchers typically consider data quality and fitness for use in individual project design, explaining the factors affecting data quality within the text of research papers. However, such explanations are not always documented in metadata accompanying the primary raw and processed datasets used in the research, and if they are documented, it is rarely in structured, standardized formats. These cases both create a number of significant problems and constraints for secondary users of the primary data.
For a variety of reasons, researchers are increasingly turning to individual and aggregated datasets collected by other projects as primary or secondary data (e.g., to augment their own original datasets) for their research. These secondary applications of data are highly dependent on researchers having a clear understanding of the provenance, methods, data-quality constraints, and prior treatments of datasets in order to support decisions about fitness for use in their particular application of the data. For secondary users of data to be able to assess fitness for use, they must be able to efficiently filter, sort, and select particular datasets that satisfy the quality criteria for their purpose. To accomplish this, it is critical for dataset metadata to describe the quality aspects of the data as comprehensively as possible, including its provenance, treatment, constraints, and biases, in a structured, standardized way. Our results indicate that currently, well-documented data are not always the norm in citizen science.
In summary, while the citizen science community may lag slightly behind ideals, this is probably in part because of the rapid evolution of scientific norms of open data, data publication, metadata, data documentation, and data reuse over the past decade, which in fact means that many corners of the global scientific enterprise are rushing to catch up. We turn now to our methods and results, before turning to a discussion of what the evolution of norms means for the citizen science community and how the community can improve its data practices.

Methods
The level of detail we sought about data management practices was rarely conveyed on project websites. Therefore, to better understand the state of the data in citizen science, we conducted structured interviews with project managers or key personnel working on the data management aspects of 36 citizen science projects (see Appendix A for a full list).

Sampling framework
Members of the Task Group began by reviewing a range of literature on citizen science data practices across the data lifecycle to inform development of study methods.
We reviewed citizen science typologies and other classification schemes to create the sampling framework. Typologies were largely drawn from academic research, and covered aspects of citizen science including governance model (Haklay 2013;Shirk et al. 2012) and scientific research discipline (Kullenberg and Kasperowski 2016;Follett and Strezov 2015). Other classification schemes included UN regions for capturing geographic distribution, and controlled vocabularies used to document variables including type of hosting organization (e.g., university, communitybased group, etc.).
The sampling framework allowed us to search for projects representing different types of diversity (e.g., in governance, in scientific research discipline, and in geographic distribution). Using this framework, we recruited participants through a three-step process. Our participants were recruited following a purposive sampling scheme designed to capture data management practices from a wide range of initiatives through a landscape sampling methodology (Bos et al. 2007). 4 First, we pulled a random sample of citizen science projects from the SciStarter database, requesting the listed contact for each project to participate in our study. In this initial sample, we found that projects in environmental citizen science, particularly biodiversity, and projects based in the US were over-represented.
We then used our sampling framework to identify gaps in the sample and sought out projects not necessarily listed on SciStarter to fill the gaps. As gaps were filled, the research team met numerous times to discuss our evolving sample and early findings. The research team then conducted additional purposive sampling until theoretical saturation was reached (Weed 2006) at 36 interviews with citizen science projects and platforms. Note that while this sampling strategy appears successful in covering a wide range of citizen science projects, it is not intended to be statistically representative of the field as a whole, and only English-speaking projects were represented.

Data collection
We began our structured interview protocol with questions from our sampling framework. Interview questions addressed various practices related to data quality and data management (see Appendix B for the interview protocol). In addition to supporting our sampling methodology, these questions enabled us to collect valuable information to help characterize our sample. The second part of our interview protocol addressed practices related to data quality and data management. We focused on these practices because our review of the literature suggested that practices related to data quality and data management (as opposed to, for example, data security) may be unique to citizen science compared with other forms of scientific research. Grounding our protocol in the existing literature allowed us to create a structured protocol with multiple choice rather than open-ended questions. For example, rather than asking participants "Where can your data be accessed?" we asked, "Can your data be accessed from: a) Project website; b) Institutional repository; c) Top-ical or field-based repository; and/or, d) Public sector data repository?" Regarding data acquisition, we asked our participants to describe the full range of data collection or processing tasks that were used in their citizen science research. For data management (including data quality), we asked about quality assurance/quality control (QA/QC) processes, including those related to data collection but also human aspects such as targeted recruitment or training; instrument control such as the use of a standardized instrument; and, data verification or validation strategies, such as voucher collection (e.g., through a photo or specimen) or expert review. We asked questions on data access, including whether access to analyzed, aggregated, and/ or raw data were provided, and how data discovery and dissemination retrieval were supported (if at all). Because they relied on known practices identified through existing literature, the vast majority of our questions were multiple choice, though participants were encouraged to elaborate on their answers or provide additional information.
Members of the research team conducted interviews or surveys, either in person, by phone, by Skype, or by email. 5 Each team member followed the same structured protocol during the interview process, although open-ended questions allowed for the collection of richer detail on selected cases.

Data analysis
Analysis was conducted through tallying responses, comparing responses with previous research, and augmenting structured responses with unstructured comments. We also compared results with prior quantitative assessments of citizen science data practices, including Schade et al. (2017) and Wiggins et al. (2011).

Results
Although we did not structure interviews directly following the data life cycle (Box 1), we solicited responses relevant to each step in the data lifecycle, except for data integration. Note that counts often exceed the total sample size because response categories are not mutually exclusive and many citizen science projects selected multiple response options for each item.
Early in our analysis, we found a number of discrepancies between self-reported information and actual practices. For example, our protocol asked project personnel to tell us, "Does the data set or access point include the name of a person to contact with questions?" A number of people we interviewed responded in the affirmative, and even suggested a specific name of their designated data point of contact, but a quick review of that project's digital presence in data catalogues, websites, and/or data repositories suggested that either no contact was given or the email listed was a generic one (e.g., info@projectname. org). In addition, the participants we interviewed, typically the scientific research leads, were not always familiar with the details of how their research was being supported by technological platforms or how their data were being managed. In some cases, an interviewee reached out to a colleague to provide follow-up information on data archiving. But in others, an interviewee offered information that was factually incorrect, for example, suggesting that a project launched with support from iNaturalist did not have the option to apply standardized data licenses when, in actuality, iNaturalist does offer this functionality. Because many of the details we asked about were not directly observable in projects' online presence, we were not able to systematically verify all of the data collected.
This finding informed our analysis and presentation of our results. For example, while our sample was large enough to support descriptive statistics such as tabulations, we believe that the format of statistical analysis implies a certainty and confidence in the findings that is not fully appropriate. A narrative reporting structure more closely aligns with the relatively exploratory nature of this study, and emphasizes the reliance of our methodology on self-report.

Characteristics of sample
The average start year of the projects in our sample was 2011, with the earliest year being 1992 and the most recent being 2017. Our sample was heavily weighted toward the environmental and biological sciences (n = 29, 81%), reflecting the early genesis of citizen science in these communities ), but also included several health-related projects (n = 7, 19%), two VGI initiatives, a general-purpose crowdsourcing initiative, and a technology development project. Most of the projects were hosted in North America (n = 19, 53%). The remaining sample was from Europe (n = 7, 19%), Oceania (n = 7, 19%), Asia (n = 6, 17%), South America (n = 2, 6%), and Africa (n = 1, 3%). Host organizations included nonprofit organizations (n = 14, 39%); academic institutions (n = 12, 33%); government agencies, including federal, state, and tribal (n = 7, 19%); and for-profit companies (n = 3, 8%). Partnerships were plentiful, with ten projects (28%) designating one or more type of organization as host. Our sample included all participation models according to the Haklay (2013) typology, though not evenly. The sample included a majority of participatory science projects (n = 21, 58%), followed by crowdsourcing (n = 13, 36%), distributed intelligence (n=2, 6%), extreme citizen science (n = 2, 6%), and volunteered computing (n = 1, 3%). Several projects reported multiple participation models, for example offering options that included participatory science contributions as well as crowdsourcing tasks. In terms of geographic scope, 11 projects (31%) were global in reach, 11 (31%) were national, six (17%) were tied to a locality such as a city or specific site, five (14%) were regional, and three (8%) involved online-only participation with no geographic component. Most of the projects involved data collection at sites chosen by the contributors, but several involved assignments to work in specific locations.

Data quality
Interview participants reported a high number of QA/QC mechanisms (Figure 1). All projects used at least one QA/QC method, while 34 (94%) used more than one method, and 22 (61%) utilized five methods or more.
Second, projects focused on the human aspects of data quality through training before data collection (n = 25, 69%) and/or on an ongoing basis (n = 11, 31%). Seven projects (19%) used targeted recruiting to find highly qualified volunteers. Four (11%) conducted volunteer testing or skill assessment.
Third, many projects approached data quality through standardizing data collection or analysis processes. Twenty-two (61%) used a standardized protocol. In addition, five (14%) used disciplinary data standards (e.g., Darwin Core for biodiversity data), and five (14%) used cross-domain standards (e.g., of the Open Geospatial Consortium [OGC]).
Fourth, many projects enabled data quality through instrument control. Fourteen (39%) used a standardized instrument for data collection or measurement. Five (14%) reported processes for instrument calibration.
Finally, a handful of projects documented their data quality practices. Seven projects (19%) shared what was classified as other documentation on a project website, while one project in our sample (3%) offered a formal QA/ QC plan.
In addition to reporting on practices, many participants spoke at length about their data quality practices. Some indicated that data quality was secured through "very simple protocols and instructions." Upon reflection, one noted that the use of simple protocols led to data collection practices that were "standardized, but not deliberately standardized." Participants also taught us that data quality practices are often rich and contextual. One explained how data were vetted according to a six-pronged approach, where all published observations must be specific, complete, and appropriate ("the content is professional and for the purpose of education, not political or to further personal agendas"). A second described how traditional data quality metrics, such as temporal accuracy, were less relevant to their work than the ability to offer a detailed reporting of a phenomenon of interest.

Data infrastructure
Our survey did not delve heavily into infrastructure for two primary reasons. First, the topic of infrastructure did not emerge as significantly as other topics in our initial

Number of QA/QC methods
Quality Assurance /Quality Control Methods literature review and scoping process. Second, during the interview process, many of the project principals we interviewed were not very familiar with the back end infrastructure supporting their projects. With this in mind, we noted that many projects adopted existing data collection applications and online communities, such as iNaturalist, BioCollect, CitSci.org, and Spotteron; leveraged existing crowdsourcing platforms, such as Zooniverse or Open Street Map; or developed their own fit-for-purpose platform with robust infrastructure, often including backups and redundancies. Smaller projects may rely on a volunteer technician to manage the data infrastructure, and here the lack of familiarity with the details of the IT backend of projects on the part of project managers may suggest some underlying fragilities.

Data security
For reasons similar to those offered above, the team did not pose specific questions related to the security of the systems used to store data (e.g., passwords, encryption, or two factor authentication), nor did we examine provisions for long-term data stewardship (e.g., archiving in trusted digital repositories). As with issues around data infrastructure, it is likely that citizen science projects vary in maturity levels with regards to adherence to standard security protocols, from relatively weak to very robust. And as with larger findings on data infrastructure, we noted that many citizen science project leaders struggled to articulate specifics around data security approaches when the topic arose organically through the interview process. Finally, while no projects reported data losses or breaches, it is conceivable that these may have occurred as a result of piecemeal approaches to infrastructure, which itself may reveal the often limited funding available for more robust approaches.

Data governance
In citizen science and other scientific research, sensitive data are often obscured. For the purpose of this study, data sensitivity addressed both data or information about citizen scientists or crowdsourcing volunteers who are contributing to research, and sensitive data that is collected by a citizen science community (Bowser et al. 2014;Bowser and Wiggins, 2015.) Twelve projects (33%) specified that they removed or anonymized personally identifiable information (PII), five projects (14%) obscured location information (typically for sensitive species), four projects (11%) reported obscuring other confidential information, and one specifically did not record individual-level information in the first place. One project had a social networking model, whereby only members could view identifying information about other members and the observations they had made: in essence members opted in and volunteered information about themselves. Six projects (17%) that made their data openly available deliberately avoided any obscuring, with one noting that an informed consent process was used to make sure participants understood and were comfortable with what was shared.

Data documentation
Regarding accompanying documentation about data collection activities, 13 (36%) included information on environmental conditions (e.g., weather, location details), 11 (31%) identified the methodology or protocol for data collection, three (8%) provided information about volunteers including characteristics or training levels, and, two (6%) included equipment details or device settings. In addition, eight projects (22%) included multiple pieces of information from the foregoing categories, most commonly environmental conditions and protocol details. Only twelve (33%) provided no additional information whatsoever. Participants were also asked a series of questions about documentation of the research study. Thirteen (36%) mentioned publishing information about the methodology or protocol, while eight (22%) documented limitations. Five projects (14%) offered fitness-for-use statements or use cases. Sometimes these were simply disclaimers, such as "data is provided as is." Participants also identified information on different types of documentation that might be helpful in fitness-for-use assessments, including whether a designated contact was available to answer additional questions.

Data access
Questions on data discovery were designed to probe whether potential users could find information on the project or data. Questions on access covered raw data, analyzed or aggregated data, and digital data services. Eighteen projects made their data discoverable through the project website. Ten projects (28%) made data available through a topical or field-based repository (such as GBIF). Further, eight projects (22%) shared their data through an institutional repository, four (11%) through a public sector data repository, and two (6%) through a publication-based repository. Only nine projects (25%) did not easily enable secondary users to find their data. Notably, some projects' data were known to be redistributed by third parties, but interviewees were unable to specify the full range of discovery and access points (at least three projects).
Access to cleaned, aggregated data was mixed. Fourteen projects (39%) published open data, defined as "available for human or machine download without restriction." Thirteen (36%) offered data upon request, including by emailing the principal investigator (PI). Interestingly, one of the projects that made data available on request had actually developed a sophisticated data dashboard and gave permissions to 15 local government agencies, not advertising this because they lacked the capacity to handle more subscribers. Six projects (17%) published open data, but required processes like creating user accounts that effectively prohibited automated access. Seven projects (19%) stated that their data were never available, though one respondent commented that access "varies," and another indicated that data were available "only to project partners." An additional interviewee noted that "my priority is to publish first the results, and then I want to look for the ways that are in place to open those data as well." Participants were asked about their use of a persistent and unique identifier, such as a GUID (globally unique identifier) or DOI (digital object identifier), and their use of a standardized data license. Eleven projects (31%) offered a persistent and unique identifier to support reuse and citation; the other 26 (72%) either did not offer one, or participants did not know. Only 16 projects (44%) had a standardized license to support data reuse. For those projects licensing their data, Creative Commons licenses were the most common. CC-BY and CC-BY-SA licenses, which require attribution, were most frequently adopted (n = 8, 22%), with five projects embracing CC0 public domain dedication (14%) and three projects (8%) using another license, such as CC BY-NC or CC BY-NC-SA, that prohibited commercial use. Beyond CC licenses, three projects (8%) reported holding or co-owning copyright, one project (3%) reported using an Open Database License (ODbL), one project (3%) reported another unnamed license. However, 18 participants (50%) did not identify any standardized license for their data, and two participants (6%) didn't know whether their project had a license or not. Numerous participants provided commentary. Some suggested that licensing was the responsibility of another team member. Others indicated a general desire to "keep it open access" or believed that even if a standardized license was not used, "the site has a FAQ that somehow addresses these questions." Notably, data provided without a license or explicit terms of use cannot really be considered open data, an important detail discussed in greater depth later on.
Projects were typically open to inquiries about their data: Twenty-six projects (72%) provided some form of contact information for data inquiries, although seven (19%) had a general project contact but no data-specific contact person, and eight (22%) provided no contact details at all.

Data services
Access to analyzed (cleaned, aggregated, summarized, visualized) data were provided in a variety of forms. Nineteen projects (53%) shared findings through project publications or whitepapers, while 16 (44%) shared findings through peer-reviewed publications. Many projects noted that scholarly publication was "a longer-term goal." Only six projects (17%) provided no access to analyzed data. Many projects used other mechanisms for sharing, some of which were specific to the audiences they served. For example, one project offered a dashboard for State government agencies with explicit partnership agreements to access data, but did not make this service available to others.
Twenty-three projects (64%) offered digital data services. Of these, 16 (44%) provided tools for user-specified queries or downloads (with several also providing application programming interfaces [APIs] for machine queries), 14 (39%) made data available through web services or data visualizations, including maps, 10 (28%) offered bulk download options, and 5 (14%) provided custom analyses or services. In addition, 1 project was willing to provide data "on request." However, 14 projects (39%) provided no specific tools for accessing data resources.

Adoption of Best Practices
We found projects were generally implementing best practices with regard to data quality (as described by Wiggins et al. 2011), but were not implementing, and generally not aware of, best practices with regard to aspects of data management such as data documentation, discovery, and access.
In regard to data quality, we were encouraged to see the wide range of practices that projects employed. The majority of our sample (34 projects, 94%) used more than one method to ensure data quality, and 20 projects (56%) used five methods or more. That said, many could only articulate data quality methods when prompted, and only one had a systematic documentation of QA/QC through a formal plan. This suggests that, contrary to some external skepticism (e.g., Nature 2015), the issue with citizen science and data quality is not in actual practices, but with the documentation-or lack thereof-to describe the care and consideration taken with QA/QC.
Many projects demonstrated willingness to make their data available, for example by suggesting that data would be shared upon request. But we found that such de facto attitude to open access was not always backed by the appropriate licensing required to establish the legal (and ethical) conditions required for reuse, nor was provision of access in formats accessible to human and machine users alike a dominant practice. This finding supports prior research conducted within the field of biodiversity, which found that out of different types of data hosted in GBIF, citizen science data were among the worst documented and most restrictive (e.g., especially by prohibiting commercial reuse; Groom, Weatherdon, and Geijzendorffer 2016). While seemingly egalitarian, progressive, and in keeping with the community ethos of some citizen science initiatives, the restriction on commercial uses or the inappropriate application of share-alike licenses 6 can prevent third parties from providing value-added data and services based on raw data, and may stymie private sector research and innovation that could be in keeping with project and participant values. It may also hinder a project's goals; for example, a primary customer of citizen science data for mosquito-vector monitoring could be commercial mosquito control groups. In addition, if citizen science data are enhanced owing to significant investments by companies, they may represent a real value proposition for all data consumers, including citizen scientists themselves. In such cases, CC-BY-NC and CC-BY-SA licenses can be viewed as regressive and not in keeping with open science principles, though the debate is nuanced and open. For example, biodiversity observations shared under an NC license cannot be used on Wikipedia (which supports a broader open data policy) to illustrate articles about species for which citizen science data may be the primary or best available records. Some citizen science projects implemented best practices in regard to data governance, including access and control. Many solutions, such as location obscuration or masking PII, were designed to protect the privacy of humans and/or sensitive species. Further, at least a handful of the projects that did not leverage these solutions had thought about implications like privacy and made a deliberate decision to prioritize, for example, principles like notice and informed consent (see also Bowser et al. 2017).
In respect to data provenance and traceability, the use of DOIs and appropriately explicit licensing statements is an issue for establishing scientific merits. One respondent indicated that "The data could have been referenced in publications, but we don't know about it," a situation that could be remedied by the use of DOIs. The global biodiversity informatics community has long recognized this issue too, and has made some progress on data archiving (Higgins et al. 2014). As one example, GBIF, together with its partners and members, implemented DOI minting and tracking mechanisms to link publications citing data sources with the original source data, which include datasets sourced from citizen science projects. While users are not required to use DOIs or even to attribute to the referenced data (CC-BY is a "requirement" that may not be enforced), it is becoming an increasingly prevalent practice in the science community. This is a persistent issue related to data citation practices: It's harder to establish impacts for fully open access data. And some practices, such as requiring registration for access to data can help to track usage, but may serve as an impediment for some users .
Finally, when datasets are not adequately described with relevant metadata, their potential for secondary uses is significantly compromised, frequently resulting in whole datasets being discounted as untrustworthy and reinforcing the perceptions of poor rigor in citizen science. Addressing this perception is critically important for citizen science-generated data to gain more trust within the research sector.
Across all aspects of data management, we found a few projects following best practices in every category, but most projects had a mishmash of practices and a clear work-in-progress narrative with respect to evolving practices as project activities progressed. As one respondent commented, "We really want scientists to use the data but we're not at a point where we would recommend that they use the data," and multiple projects reported plans to achieve higher levels of data management for several items we asked about. Further, many respondents, including project managers who had dedicated IT support or leveraged an external platform, often did not know details of their data management practices, as these duties were delegated to others (consistent with Wiggins et al. 2011). In a similar vein, several respondents noted that they had not written their project's data management plans nor designed the technological workflows themselves; these tasks had been outsourced, leaving our respondents unable to fully answer the questions asked.
This was particularly notable for projects whose data access, services, and persistent identifiers were provided by a platform that offered data hosting. While this may be a reasonable option, particularly for smaller or startup citizen science projects, and whereas taking advantage of the expertise of an interdisciplinary team is often advocated, it can clearly lead to a lack of awareness about data practices, with potential consequences for data strategies. Outsourcing may lead to, or be a sign of, inattention to the importance of decisions made by the data host. This inattentiveness could lead to issues down the road if infrastructure should fail or security be lax. As one respondent explained, each project that was affiliated with the larger program made their own data sharing decisions, but deciding to make data openly available did not mean that that the lead researcher assumed responsibility for depositing the data into an open access database with a persistent identifier. Project managers were not always sure who was responsible to carry out policyoriented dictates for data management and preservation. While not all data need to be archived, at present probably too little are being proactively preserved for the long term.
In some cases, the adoption of best practices in citizen science data management may be similar to or lagging only slightly behind those of conventional science. For example, we found that in regard to data discovery and access, ten projects (28%) made data available through a topical or field-based repository (such as GBIF), eight (22%) through an institutional repository, four (11%) through a public sector data repository, and two (6%) through a publication-based repository. In comparison, Tenopir et al. (2015) found that 27.5% of the researchers in their survey made their data available through a discipline-based repository, 32.8% through an institutional repository, and 18.4% through a publication-based repository. Comparing these studies suggests that both citizen and conventional science lag far behind the ideal. But the consequences are more significant for citizen science. Widespread adoption of best practices in data management in citizen science would provide much needed transparency about data collection and cleaning practices and could go a long way in advancing the reputation of the field. It could also help satisfy citizen science's commitment to ethical principles, as outlined in the Holdren Memorandum and ECSA's 10 principles of citizen science (Holdren 2015).
While the questions on data management and discovery practices often focused on a scientific user audience, it is important to recall that the scientific research community isn't always the primary audience for a citizen science project: Local communities, students, or other parties may be a target audience, for whom access through a project website is preferable and analyzed products may be preferred over raw data access. However, data access is also reflective of current archival practices and long-term stewardship choices. From this perspective, most of the projects in this study were not positioned to ensure long-term access to data, and in the majority of cases, data sustainability appears tenuous at best.

Infrastructure and technology impacts
Databases, software applications, mobile apps, and other e-infrastructures supporting citizen science have a significant role to play in facilitating improvements in data quality. Such infrastructures can, if they conform to appropriate standards and use good design principles, make the data more discoverable, more accessible, more reusable, more trusted, more interoperable with other systems, more accurate, and less prone to human-induced errors (Brenton et. al. 2018). Good design and open infrastructures enable efficient and simple data recording and management by using workflows, processes, and usercentered design to minimize the risk of user errors and ensure that consistent data formats and mandatory attributes are recorded correctly, along with consistent use of vocabularies, spatial referencing, and dates. At the same time, providing project managers with adequate and easily understood reference information about the default policies that apply to hosted data seemed to be a clear gap for our respondents.
At the global scale, and indeed in many countries, it would be fair to say that the e-infrastructures currently supporting the majority of citizen science projects are largely functioning independently of each other and are not often adequately ascribing metadata to describe the datasets and methods. In addition, very few e-infrastructures are currently implementing any commonly used data standards. This effectively isolates these systems from each other and from being able to share data in ways that can open doors to important new scientific insights through, for example, larger aggregated views and analyses based on spatially and temporally dense datasets.
However, there are examples in some countries where efforts are being made to bridge the e-infrastructure divide. Firstly, the Public Participation in Scientific Research-Core (PPSR-Core) project is an initiative of the citizen science associations (US, European, and Australian Citizen Science Associations) in partnership with the OGC and World Wide Web Consortium (W3C) to develop a set of standards for citizen science data, metadata, and data exchange protocols. Within each of the association regions there are separate third-party platform-based initiatives to support individual citizen science projects (e.g., CitSci.org, Zooniverse, iNaturalist and SciStarter [US]; BioCollect [Australia]; and Spotteron [Europe]). Some of these multi-project platforms are already implementing the PPSR-Core standards as they evolve and are already sharing project-level metadata amongst each other to improve the discoverability of citizen science projects. As a next step, researchers working with Earth Challenge 2020 and the Frontiers open access publication series are creating a metadata repository to facilitate the discovery and access of citizen science data.
Assuming that standards and best practices already exist in an accessible and usable form (which was not universally the case at the time of writing) to apply them in e-infrastructure and data management solutions, providers should codify them into their software to ensure consistency and offer guidance for users, particularly those inexperienced with such matters. However, one interviewee noted that adopting a third-party platform to manage their data did not allow them to direct data management practices because they didn't have control of the technical infrastructure to impose their own field-specific or project-specific preferences. This presents a significant challenge for infrastructure providers, as it suggests that software is expected to be both highly configurable around individual user needs while applying standards, rules, and workflows that assist users to apply best practices in data collection and management. At the extremes, these are diametrically opposed concepts, but it is possible to provide flexible solutions within a standards-constrained environment. Achieving the right balance between flexibility and appropriately structured constraints will require both project owners and infrastructure providers to be aware of standards and best practices, as well as for providers to be transparent as to if or how they are applied in their platforms.

The human dimension
A fundamental rationale for improving data management practices in citizen science is to ensure the ability of citizens, scientists, and policy makers to reuse the data for scientific research or policy purposes. Mayernik (2017) explores how hard and soft incentives can help support open data initiatives. Hard incentives include requirements by funders like the National Science Foundation (NSF) in the USA for researchers to supply data management plans or requirements from publishers that mandate publishing data in conjunction with a research article. Mayernik also uses the concepts of accountability and transparency to explore additional factors that may limit reuse. Transparency includes requirements for making data discoverable and can be charted on a spectrum. For example, providing a link to data online with brief textual descriptions is less transparent than registering data in a catalogue (metadata repository) with standardized descriptions and/or tags.
Culture also has a significant role to play. In line with broader discussions of open science (David 2008;Levin and Leonelli 2017;Munafò et al. 2017), traditional academic cultures often fail to incentivize researchers for good data management to enable reuse. Here, the use of DOIs can be a technical solution that also enables cultural change if researchers can get credit when other researchers are able to find, use, and ultimately cite their data. There is also an opportunity for cultural change specifically within the citizen science community. By evoking aspirational guidelines such as those outlined in the Holdren Memo and ECSA's 10 principles (Hodren 2015), linking good data management practices to already-articulated community values like transparency can create pressure for researchers to make their data more discoverable and accessible as an ethical imperative.

Conclusions and Recommendations
While citizen science has emerged as a promising means to collect data on a massive scale and is maturing in regard to data practices, there is still much progress to be made in approaches to the data lifecycle from acquisition to man-agement to dissemination. This reflects the speed of development of scientific data management norms and the fact that the scientific community as a whole has difficulty keeping up. However, it may also reflect lack of resources, particularly for smaller or startup citizen science efforts that struggle to maintain staff and funding and perhaps find that data management falls to the bottom of the to-do list. Finally, the fact that many of those who start citizen science projects are motivated primarily by intellectual curiosity, educational goals, environmental justice, or the desire to inform society about significant challenges, may be reflected in project founders who may lack the background in data practices that could carry their work to the next level. The characterization of data practices in this paper is not intended as a criticism of the field, but rather an effort to identify areas where improvements are needed and to provide a call to action and greater maturation. We will have succeeded to the degree that we have educated the citizen science community about emerging practices that can help to improve the usability of their data for not only scientific research but also to solve important societal and environmental problems.
For projects that seek to elevate the value of their data for reuse, we propose a number of steps that could help to increase conformity to data management best practices (Box 2).
There are a number of limitations to this research, including the small sample size and the reliance on self-reported information by respondents. Reliance on self-reported information is particularly challenging given the discrepancy between self-reported information and actual practices, as described above.
These discrepancies offer significant opportunities for research and practical work. While the finding that project leaders do not necessarily understand their data management practices offers an important insight, there is a need for clarity regarding what actual practices are most and least common. A follow-up study could compare selfreported with actual practices by, for example, complementing self-report methodologies with desk research, perhaps developing profiles of projects with certain data management practices, or even quantifying the strength of data management approaches. There is a related opportunity to conduct studies of research role differentiation within citizen science projects, and map the different types of expertise, such as scientific, technological, or educational knowledge, represented on a project support team, which may be distributed across a number of departments or institutions.
Our landscape sampling framework sought to identify and characterize a wide range of practices across different types of citizen science projects. Others, including Schade and colleagues, have leveraged different methodologies, such as large-scale surveys, that attempt to gain a more representiative view (2017). Future research could leverage random or purposive sampling to build Box 2: Recommendations.
This box provides key recommendations for improving data management practices that can be applied across a wide range of citizen science initiatives. Recommendations are offered for individual researchers, and for the field writ large. Additional helpful information may be found in a primer published by DataONE (Wiggins et al. 2013), though more work may be needed to identify an updated set of best practices for broad citizen science communities to use.
Data quality: While significant quality assurance/quality control (QA/QC) checks are taken across the data lifecycle, these are not always documented in a standardized way. Citizen science practitioners should document their QA/QC practices on project websites and/or through formal QA/QC plans. Researchers seeking to advance the field could help develop controlled vocabularies for articulating common data-quality practices that can be included in metadata for data sets and/or observations. Data infrastructure: Citizen science practitioners should consider leveraging existing infrastructures across the data lifecycle, such as for data collection and data archiving, e.g., in large and stable data aggregation repositories. Researchers seeking to advance the field should fully document supporting infrastructures to make their strengths and limitations transparent and increase their utility, as well as develop additional supporting infrastructures as needed.
Data governance: Relevant considerations include privacy and ethical data use, such as ensuring the protection of sensitive location-based information, personally identifiable information (PII), and proper use of licensing. Citizen science practitioners should carefully consider tradeoffs between openness and privacy. Researchers seeking to advance the field could develop standard data policies, including privacy policies and terms of use, that clearly outline data governance practices.
Data documentation: Citizen science practitioners should make discovery metadata (structured descriptive information about data sets) available through data catalogues, and should share information on methods used to develop data sets on project websites. Researchers seeking to advance the field could develop controlled vocabularies for metadata documentation, particularly to enable fitness for purpose assessments.
Data access: In addition to discovery metadata, citizen science practitioners should select and use one or more open, machine-readable licenses like the Creative Commons licenses. Researchers seeking to advance the field should identify, share information about, and if necessary develop long-term infrastructures for data discovery and preservation.