A Framework for Using AI Responsibly with Federal Datasets
Authors
January 2026
AI can enhance data quality and integration, helping federal agencies prepare complex datasets for analysis while keeping transparency and trust.
Over the past year, NORC has been exploring how artificial intelligence (AI) can help streamline the preparation of high-quality data for use across the federal statistical system. Our exploration is part of NORC’s broader portfolio of work for the National Secure Data Service-Demonstration project managed by the National Center for Science and Engineering Statistics.
Many federal agencies rely on data, including administrative records and geospatial data, that offer valuable and timely information. Because these data are collected for reasons other than to produce statistical estimates, they require extensive processing before they can be used effectively.
To guide our work, we developed a framework plan that identifies promising opportunities for AI to support data standardization, integration, and quality improvement. Drawing from expert interviews and a comprehensive literature scan, the plan highlights areas where AI can add value and new tools are needed to reduce the time and effort required to prepare data for analysis.
AI can unlock metadata and make data easier to find and use.
One of the most exciting opportunities we’ve identified so far is the potential for AI to support metadata creation and standardization, which are essential for making data easier to discover, interpret, and use. Incomplete or inconsistent metadata can make it difficult for analysts to assess data quality or understand variable definitions.
Large language models (LLMs) can add value to human expertise by identifying missing documentation, suggesting variable labels, and generating descriptive metadata. They can also enrich existing codebooks and standardize metadata across datasets by aligning formats and schemas, making it easier to integrate data from different sources. These capabilities are especially valuable when working with datasets that lack sufficient documentation.
Another promising area is data integration. Federal agencies often work with datasets collected across jurisdictions or over time, and these datasets may not align. Variables may be named differently, coded inconsistently, or formatted in incompatible ways.
For example, municipalities may publish crime statistics using different formats; some may use text fields for ZIP codes, while others use numeric formats. These inconsistencies make it difficult to compare or combine datasets. AI can assist harmonization across sources by generating code to standardize fields and flag inconsistencies for review. These capabilities can help analysts spend less time cleaning data and more time generating insights.
AI has the potential to reshape federal statistics through responsible design.
Through our research, we surfaced several high-impact opportunities for AI to support data preparation across the federal statistical system. Our interviews and literature scan identified recurring challenges such as missing metadata, inconsistent formatting, and inconsistent variables that make it difficult to reuse and integrate data.
These challenges are especially pronounced for nontraditional data sources like administrative records and geospatial data. Our team is now developing a suite of AI tools for a future National Secure Data Service to address these needs, with capabilities that include metadata extraction, data harmonization, and quality assessment. NORC is designing these tools to be flexible, transparent, and responsive to the real-world workflows of federal data users.
One of the key takeaways from our expert interviews was the importance of building AI systems that allow users to understand how outputs are generated and to intervene when needed. Rather than automating decisions, the new tools will assist analysts by flagging potential issues, suggesting transformations, and surfacing metadata while leaving room for review and adjustment.
For example, several interviewees described challenges working with tabular datasets that lacked variable-level documentation. In these cases, analysts had to manually infer variable meanings, formats, and coding schemes. Our team’s tools will use AI to extract metadata directly from the data by identifying variable types, suggesting labels, and flagging inconsistencies while allowing users to review and refine the output.
This approach improves metadata quality and ensures that analysts remain in control of the process. The human-in-the-loop design is central to our vision for responsible and trustworthy AI in the federal statistical system.
Transparency and user controls are essential as AI technology evolves.
We believe AI has the potential to reshape how data are documented, standardized, and prepared for use. By focusing on metadata extraction, data harmonization, and quality assessment, our team is addressing some of the most persistent challenges faced by data analysts. Realizing this potential requires transparency, explainability, and robust documentation. Users must be able to understand how AI-generated outputs were produced, what limitations exist, and how those outputs may evolve as underlying models change.
Our framework plan lays the foundation, and the tools our team is building are designed to grow with the technology while remaining grounded in the needs of federal data users. We’re excited to share more as development continues and to help shape a future where AI can enhance human expertise in the federal statistical system.
Main Takeaways
- NORC’s framework plan identifies high-impact opportunities for AI to improve data quality, standardization, and integration across the federal statistical system.
- AI can enhance metadata creation and documentation, making datasets easier to discover, interpret, and reuse, especially when working with incomplete or inconsistent metadata.
- AI supports data harmonization and integration, helping analysts align variables, standardize formats, and reduce time spent on cleaning and preprocessing.
- NORC is developing a suite of AI tools focused on metadata extraction, data harmonization, and quality assessment, designed to be transparent, flexible, and user-guided.
- Human oversight remains essential, and tools are being built with a human-in-the-loop approach to ensure transparency, explainability, and responsible use as AI technologies evolve.
Suggested Citation
Lafia, S. & Seeskin, Z. (2026, January 26). A Framework for Using AI Responsibly with Federal Datasets. [Web blog post]. NORC at the University of Chicago. Retrieved from www.norc.org.