menu

Phase 1Demand Phase 2Supply Phase 3Collaboration Phase 4Design Phase 5Implementation Phase 6Communication Phase 7Learning Phase 8Iteration

Establishing an effective data collaborative requires a detailed, upfront understanding of the demand side of the initiative: i.e., an in-depth understanding of the problem to be addressed and the opportunity provided by cross-sector data sharing.

Step 1Define the problem to be solved

Discussion

Gaining a clear sense of the problem you are seeking to solve is an essential first step in establishing a successful data collaborative. While defining the problem space might seem like a straightforward proposition, achieving the level of precision needed for a well-targeted data collaborative requires an in-depth assessment of the problem space. Zeroing in on the problem can be difficult. To pinpoint an actionable problem with a specific data collaboration solution rather than a vague issue, requires writing and re-writing a one page Problem Statement. The Problem Statement articulates the problem with precision and makes assumptions explicit. It asks and answers why the problem has not been solved yet. It might also address who is harmed by the existence of the problem and why and what are the root causes of the problem. It might take several drafts to strip down to a statement of an actionable problem and its causes. But undertaking this exercise – especially doing so collaboratively with the participation of the key stakeholders involved in the issue – will help to build consensus behind the implementation of the project.

Questions & Considerations

  • Describe issue to be addressed
  • Identify the beneficiaries
  • Explain why now
  • Articulate why it matters
  • Note any assumptions
  • Examine any counterarguments or related controversies
  • Explore current work to address the problem (internally and externally)

Resources

Step 2Define the value proposition of the data collaborative

Discussion

Research and practice in data collaboratives point to a number of societal benefits arising from the cross-sector sharing of data. After clearly defining the problem that a data collaborative will serve to address, organizers should look to gain specificity on the specific benefit a data collaborative could offer. Without assessing the value of the data collaborative, tradeoffs cannot later be measured. To understand whether the use of a corporate data is worthwhile despite the risks involved, and to find the proper steps to take that mitigate risk, it is important to evaluate the context for the use of the data. This might involve an assessment of the urgency of having access to the data. Is there a disaster relief component or other time sensitivity? If it is hard to define the problem or hard to define the value, it will be impossible to evaluate the success of any data project. To justify the risks and the potential liability that arises from using and analyzing corporate data requires having a clearly articulated benefit that can be measured.

Questions & Considerations

What is the intended societal benefit of the data collaborative?

  • Situational awareness and response?
  • Public service design and delivery?
  • Knowledge creation and transfer?
  • Prediction and forecasting?
  • Impact assessment and evaluation?

Examples

  • The NCI’S Genomic Data Commons (GDC) contains NCI-generated data from a number of cancer genomic datasets, including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Therapies (TARGET). The primary goal of the GDC is to provide the cancer research community with a unified data repository supporting cancer genomic studies.
  • Global Fishing Watch is the result of a collaboration among Google, Oceana and SkyTruth to map and measure fishing activity worldwide by mapping data from the Automatic Identification System (AIS), used by more than 100,000 vessels worldwide. This information from Global Fishing Watch can be used by governments to ensure that fishing regulations are adequately monitored and tracked, allowing them to respond to illegal fishing rapidly and efficiently.
  • Grameenphone, the leading telecommunications provider in Bangladesh, shares its mobile call data records to understand climate impacts by mapping population flows before and after extreme weather events.

Resources

Once the demand side of the equation is well understood and articulated, the supply of data and expertise (i.e., human capital) should be explored in order to determine how the supply can address the demand (or if there is not a match between supply and demand and, therefore, a data collaborative would be unconducive to the issue at hand).

Step 3Data Science Expertise and Organizational Competency

Discussion

Cross-sector data sharing requires not only the bringing together of diverse datasets, but also collaboration between people and organizations with different skills and institutional norms. An upfront understanding of how human and institutional capacity and cultures either do or do not mesh can help to define optimal roles and responsibilities and identify capacity gaps that need to be filled (through additional partnerships or other mechanisms).

Questions & Considerations

Internal
  • Evaluate and review internal capacity
  • Build high-level internal support
  • Identify internal data literacy and human capital
  • Collect and share examples of how other organizations have leveraged data
    External
  • Which entities could provide additional competencies (data science and otherwise)?

Examples

Resources

Step 4Data Supply

Discussion

In many ways, data collaboratives are means for filling gaps in existing institutional data supplies. In order to fill such gaps, organizers should conduct an upfront audit of both the existing internal data supply and the potential supply of data existing in other sectors. Such an audit can provide insight regarding the data relevant to your project, who has it, who needs it, and how it can be used to tackle the problem. Data audits can also help inform and prioritize outreach to external data holders and ensure that newly accessible datasets are well-positioned for filling the most important data gaps. Too often the assumption is that all data must be made available when only a few data points are needed. As part of the data inventory, other considerations may include assessing the reputation and reliability of those who have the data and those who want it in terms of security and data responsibility. In and of itself, undertaking the data inventory helps to mitigate risk by helping to develop strategies for how to use the data. A more detailed analysis by interdisciplinary experts may help to identify technical or procedural workarounds for seemingly difficult or expensive tasks.

Questions & Considerations

Internal: Conduct due diligence research and a data audit
  • What data currently exists that will be useful in addressing the problem?
  • What are the data gaps that slow progress toward addressing the problem?
    External: Map the potential supply side based on identified gaps
  • Develop a data inventory – what data do you already have?
  • Which entities (whether private, governmental or nongovernmental) possess and could provide the needed data?

Enablers

  • Engaging domain experts and outside expertise to better understand what types of data would be useful and which entities might have access
  • Exploring data intermediaries that help to make available data useful to target audiences

Examples

  • The New York City Business Atlas was developed by the Mayor’s Office of Data Analytics (MODA) by bringing together open government data relevant to small businesspeople interested in opening physical locations in the city to inform their decisionmaking. Official datasets, however, did not provide a sense of the foot traffic in different areas of the city. To fill this gap, MODA partnered with Placemeter, a local startup that analyzes video imagery to understand pedestrian flows throughout the city.

Resources

Data Collaboratives are, by definition, collaborations. The work done to this point set the stage for strategically taking the first steps toward the establishment of a cross-sector data-sharing arrangement. Choosing the right supply-side partner and adequately incentivizing them to collaborate are key steps for making the data collaborative a reality.

Step 5Select the most promising potential supply-side data providers and identify specific incentives for them to participate

Discussion

When and why corporations contribute their data differs according to the context in which the data is being requested or shared, the question access to their data may answer and the corporate and legal culture of the firm. Different corporations also have different views regarding the expected benefits and risks from sharing their data. As such, when firms extend themselves and share their data they seek to satisfy a variety of motivations.

Questions & Considerations

What entities are best suited for filling identified data gaps and/or expertise?
Why would different actors be motivated to collaborate?
  • Reciprocity: corporations may share their data with others for mutual benefit, especially gaining access to other data sources that may be important to their own business decisions. Some corporations may also reciprocate due to a sense of “giving back that was taken” from individuals and society-at-large.
  • Research & insights: opening up their data may generate new answers to particular questions providing companies insights that may not have been extracted otherwise. Just as with open source, sharing data (and in some cases algorithms) can enable corporations to tap into data analytical skills (often free labor) distributed beyond the boundaries of their own company. External users may interrogate the data in new ways and use the skills and methodologies not readily available in the company. It may also create the potential to identify and hire valuable talent that can emerge from data. In addition, these insights may enable companies to identify new niches for activity and to develop new business models.
  • Reputation and public relations: sharing data for public good may enhance a firm’s corporate image and reputation, potentially attracting new users and customers. It may also offer an opportunity to gain (free) media attention and increase visibility among certain decision makers and other audiences.
  • Revenue generation: opening up corporate data does not always have to be for free. Under some conditions, corporate data may be offered for sale, generating extra revenue for firms.
  • Regulatory compliance: sharing data can also help corporations comply with sectoral regulations and becoming more transparent and trusted. In addition, many corporations are generating data often for the sole purpose of regulatory compliance. Sharing and using that data in a responsible manner (see below regarding some of the risks of opening up data) for public and private beneficial ways may leverage more broadly the investment made to collect the data for a narrow purpose.
  • Responsibility and corporate philanthropy: finally, sharing corporate data achieves many of the goals sought by traditional corporate social responsibility or philanthropy, where a company derives value from socially responsible behavior not just because of the positive image such an activity produces, but because opening up data can also improve the competitive business environment within which the business operates.

Enablers

  • Understanding the potential data-provider’s biases and priorities
  • Establishing a win-win articulation rather than a “data philanthropy” framing

Examples

  • Accelerating Medicines Partnership (AMP) – The ten pharmaceutical companies participating in the AMP pool their data to help the National Institutes of Health and Food and Drug Administration accelerate research into disease treatment. For each company, sharing their data opens up access to data from their competitors helping to, for instance, avoid redundant R&D efforts

Resources

The design stage involves determining the specifics of how the data will be shared and used, an assessment of the most salient risks and likely harms, the creation of a targeted strategy for mitigating those risks, and work to determine the ongoing governance framework for the data collaborative.

Step 6Define ideal type of data collaborative based on supply/demand

Discussion

Data collaboratives exist in a number of forms, each of which is better positioned for a certain problem or date type. The current field of practice shows that there are currently six main types of data collaboratives, the preceding steps in this canvas should help to make clear which of the following collaboration mechanisms is best-suited for the opportunity at hand.

Questions & Considerations

Given the insights gained during previous steps, which data collaborative type is best suited to the problem at hand?

  • Trusted Intermediary, where companies share data with a limited number of known partners. Companies generally share data with these entities for data analysis and modeling, as well as other value chain activities.
  • Prizes or Challenge, in which companies make data available to qualified applicants who compete to develop new apps or discover innovative uses for the data. Companies typically host these contests in an effort to incentivize a wide range of civic hackers, pro-bono data scientists and other expert users to find innovative solutions with the available data.
  • Research Partnership, in which corporations share data with universities and other research organizations. Through partnerships with corporate data providers, several research organizations are conducting experiments using anonymized and aggregated samples of consumer datasets and other sources of data to analyze social trends.
  • Intelligence Products, where companies share (often aggregated) data that provides general insight into market conditions, customer demographic information, or other broad trends.
  • Application Programming Interfaces (APIs), which allow developers and others to access data for testing, product development, and data analytics. By signing a terms of service agreement, companies give access to streams of its data in order to build applications.
  • Corporate Data Cooperatives or Pooling, in which corporations — and other important dataholders such as government agencies — group together to create “collaborative databases” with shared data resources. These collaborations typically require an organizing partner as well as technical and legal frameworks surrounding the use and distribution of the data.

Enablers

  • Designing the data collaborative based on the intended audience or beneficiary

Examples

  • Trusted Intermediary: South Africa-based telecom MTN makes anonymized call records available to researchers through a trusted intermediary, Real Impacts Analytics — a data analytics firm that provides guided and predictive analytics solutions through its Data for Good Program.
  • Prizes or Challenges: In Ivory Coast and Senegal, Orange Telecom hosted a global challenge – the Orange Telecom Data for Development Challenge – that allowed researchers to use anonymized, aggregated data to help solve various development problems, including those related to transportation, health, and agriculture.
  • Research Partnership: Yelp shares its data on neighborhood businesses with 30 universities for researchers to build tools and discover meaningful value in the data. Using shared data on Yelp businesses in the San Francisco Bay Area, an academic research team from U.C. Berkeley used a probabilistic model for natural language processing to detect subtopics across a dataset of over 200,000 Yelp business reviews. Their research uncovered correlations between positive ratings and service quality, giving business owners evidence for improving their services.
  • APIs: Facebook Open Graph Search allows for consumers and companies to mine social graphs for search query-based data, such as demographic and location data, “likes,” and multimedia. Companies such as Slate and Upworthy have used available data from Open Graph Search to optimize their headlines and increase readership.
  • Corporate Data Cooperatives or Pooling: Through its Accelerating Medicines Partnership, the US National Institutes of Health (NIH) is helping organize data pooling among the world’s largest biopharmaceutical companies in order to identify promising drug and diagnostic targets for Alzheimer’s disease.

Resources

Step 7Assess major risks, ethical concerns and potential challenges

Discussion

The collection, processing, sharing, analysis and use of data introduce a number of risks and challenges for stakeholders involved in data collaboratives. Rather than seeking to mitigate the realized harms arising from those risks after the fact, stakeholders should seek to understand the risks at every stage of the data lifecycle in order to develop well-targeted strategies for mitigating them.

Questions & Considerations

Understand risks across the data lifecycle (with particular focus on the sharing and use stages)
  • Collection
    • Collection of inaccurate or “dirty” data
    • Unauthorized data collection
    • Incomplete, non-representative sampling
  • Analysis & Processing
    • Insufficient, outdated, or inflexible security provisions
    • Aggregation/correlation of incomparable datasets
    • Lack of academic rigor
    • Each of the above can heavily influence outcome of study or misrepresent the data
  • Sharing
    • Incompatible cultural or institutional norms or expectations
    • Lack of stewardship on both ends to ensure responsible sharing of personally identifiable information as it travels across cases and sectors
    • Improper or unauthorized access to shared data
    • Conflicting legal jurisdictions and different levels of security
  • Use
    • Controversial or incongruous data usage
    • Misinterpretation of data
    • Possible re-identification of individuals
    • Decisional interference
  • Ensuring proprietary data isn’t subject to Freedom of Information laws
  • Does the data collaborative (or the partner) raise any sensitive political concerns?
  • Does the collaboration comport with the cultural (and societal) expectations of all parties?
List potential unintended consequences
  • Profiling and Discrimination
  • Entrenching Existing Biases and Power Dynamics
Anticipate potential harms to corporations
  • Criminal or civil legal investigations and/or regulatory fines;
  • Loss of regulatory licenses, standards, certifications;
  • Reputational and industrial damages – impacting competitive positioning and advantage;
    • share price and/or cost of capital;
    • customer attrition rates;
    • employee recruitment, productivity and retention
  • Overall increase in operating expenses
  • Anticipate potential operational challenges

Enablers

Understanding that risks can be cumulative – i.e., that risks at the collection stage can grow and compound at later stages of the data lifecycle.

Examples

InBloom aimed to store, and aggregate student data for states and districts but was met with privacy concerns regarding the use and storage of personally identifiable information. The firm shuttered in 2014.

Resources

Step 8Develop a multi-faceted risk mitigation strategy

Discussion

Armed with a better understanding of the risks present in a data collaborative, organizers can develop strategies and responsibility frameworks to help mitigate those risks before they have real-world consequences.

Questions & Considerations

  • Prioritize risks according to their likelihood of becoming reality and the severity of the harms they would create.
  • Consider alternative mechanisms or datasets that would not introduce the same level of risk.
  • Study how similar uses of data by other entities were either successful or unsuccessful in mitigating risks (e.g., through anonymization, data security techniques)
  • Ensure all stakeholders are in agreement regarding the ways in which risks will be mitigated, and the priority placed on mitigating especially salient risks
  • Continue to monitor risks and the effectiveness of mitigation strategies throughout the lifespan of the data collaborative

Enablers

  • An upfront understanding of the most salient risks

Examples

  • UN OCHA Data Responsibility Framework: Working with the GovLab, Harvard Humanitarian Initiative, and Leiden University Centre for Innovation, the U.N. Office for the Coordination of Humanitarian Affairs developed a Data Responsibility Framework to inform the use of shared data during humanitarian crises.

Resources

Step 9Agree upon terms and conditions for arrangement

Discussion

Beyond risks related to the data lifecycle, data collaboratives introduce questions and uncertainties around roles and responsibilities, ownership, intellectual property and other concerns. The creation of a list of agreed-upon terms and conditions can ensure clarity regarding such questions and help to avoid.

Questions & Considerations

  • Liability
  • Intellectual property provisions
  • Data ownership and handling
  • Cost
  • Public release and transparency

Resources

Step 10Establish a Governance Structure

Discussion

Establishing a data collaborative requires a number of upfront efforts across stakeholder groups. The many decisions to be made and responsibilities present in such an arrangement, however, do not end at the implementation stage. An agreed upon governance structure for the lifespan of the data collaborative can help to ensure that the processes for making important decisions – whether, for example, related to new uses for datasets or unanticipated risks coming into view – are clearly defined and understood by all parties. In addition, for the effort to be seen as legitimate, the process of developing data collaborative policies needs to be collaboratively engaging and consulting with a variety of groups, including both private sector and impacted citizens. Such consultation is also part of identifying potential benefits and these steps can be brought together even though we distinguish between them here.

Questions & Considerations

  • What will the data collaborative’s decision-making process and hierarchy be?
  • How can stakeholders create feedback loops to ensure that progress isn’t made in isolation?
  • How will concerns be acted upon?
  • What will the process be for taking the data collaborative in a new direction or abandoning it should it prove innefective?

Resources

By stage five, all parties should have a clear understanding of how the arrangement will work, the key risks and strategies for addressing them, and the processes that will ensure the data collaborative runs as expected. At the implementation stage, the data collaborative is launched, but only after expectations, roles, timeline and cost questions are answered. As with any product launch, the creation of a data collaborative requires thinking about what will happen, when, by whom and in what order. To cross the chasm from idea to implementation and, for example, to persuade others to take the steps necessary to embark on a new plan of action, you must be able to draft an implementation memo that lays out the steps of the project from data sharing to use to analysis to evaluation. The memo also needs to address the resources that need to be secured for the data collaborative to become sustainable. This goes beyond mere financial budgeting to include physical, human, data assets and cultural conditions needed to be successful.

Step 11Agree upon expectations, roles, responsibilities, timeline and operational specifics of data-sharing process

Discussion

Upon completing the preceding steps, stakeholders in the data collaborative should possess a clear understanding of how the arrangement can be put into practice. With this upfront knowledge, the operational aspects of an effective data collaborative can be defined – with the understanding that specifics can and should be iterated upon as needed going forward.

Questions & Considerations

  • What are the expectations of each of the entities, and how does the initiative align with the business (or, if relevant, philanthropic) mission of the participating private sector entity?
  • What is the management structure for the data collaborative? Who is responsible for stewarding the process?
  • Who are the champions in each participating entity?
  • What skills are required within the governing institution?
  • What information or data will be made open to the public
  • Other partners that need to be engaged?

Step 12Determine Resources: Cost and Funding Models

Discussion

Establishing a data collaborative is often less expensive than creating the mechanisms to generate and collect data that is already held elsewhere, but there are often costs involved – including human capital costs for data scientists and stewards. An upfront and realistic assessment of the likely costs of the arrangement can inform strategic funding decisions – whether a tiered pricing model for a B2C, B2B or B2G data collaborative or seeking support from philanthropic and/or governmental grantmakers.

Questions & Considerations

  • Determine cost implications
  • Identify funding partners/models

Enablers

  • Explore mixed funding approaches to provide more opportunities and avenues for ensuring longer-term sustainability.

Examples

Resources

In order to engage the intended beneficiaries for the data collaborative (where relevant) and/or the communities who could help to promote the effort (e.g., media), participants need to create a multi-faceted communications strategy and approach for disseminating information on an ongoing basis.

Step 13Develop a communications strategy

Discussion

Especially for data collaboratives where corporate data providers were incentivized to participate based on reputational benefits, communicating the objectives and (intended) impacts of the arrangement to the public can be important. In many cases, a high level of specificity in public communications may not be desirable, but promoting the existence of the data collaborative can spur interest and engagement among target audiences (including funders).

Questions & Considerations

  • What are the key messages that should be communicated to the public regarding the data collaborative?
  • Should communication be ongoing throughout the initiative’s lifespan, or primarily at launch and at the impact assessment stage?

Resources

Step 14Determine audience and information sharing approach

Discussion

Some data collaboratives are primarily or exclusively focused on improving the data capacity of participating institutions. Others, however, have additional, external audiences or user groups. Clearly defining the audience(s) and their needs can enable stakeholders to craft an information sharing approach that is well-suited to maximizing the usefulness of newly created data-driven offerings.

Questions & Considerations

  • Who are the intended beneficiaries or user groups for the data collaborative?
  • Do the different user groups exist in different sectors?
  • Do the different user groups have divergent needs or introduce divergent challenges?
  • Is there a particular user group that should be prioritized?
  • What are the communication channels through which the intended audiences can best be reached?

Resources

Meaningfully measuring success and impact will be key for maintaining the data collaborative over a longer timeframe, accessing new or maintaining existing funding and ensuring that the current mix of variables and decisions yielded the optimal approach for the data collaborative.

Step 15Define a common baseline against which to measure progress

Discussion

Building on the work done at the problem definition and data audit stage, defining the baseline of current practice will ensure that the impact of a data collaborative (or the lack thereof) can be meaningfully assessed. Without an understanding of the effectiveness of current efforts to address the problem, measuring success and iterating on new data practices will be challenging.

Questions & Considerations

  • What indicators are most representative of the issues the data collaborative is meant to assess?
  • Are any externally held datasets available (e.g., open data) to help understand the current problem baseline?

Enablers

  • The initial data audit conducted during phase 1 likely should have uncovered some useful baseline data

Step 16Measure progress against defined, agreed-upon metrics of success

Discussion

In order to measure progress throughout the lifespan of the data collaborative, ensure that mechanisms are in place for the consistent generation of data enabling assessment against the baseline. While much of the work of impact assessment is done at the start or conclusion of such an initiative, upfront efforts to create or gain access to data about progress throughout can help to inform iteration and improve the likelihood of success.

Questions & Considerations

  • Does the data collaborative appear to be having an impact in intended areas?
  • Do the metrics uncover any unanticipated impacts to date?

Resources

Step 17Impact assessment

Discussion

After an agreed upon period of time, stakeholders should conduct a detailed impact assessment to determine the real-world impacts of the data collaborative.

Resources

Step 18Iterate as needed