Training Data Inventory

Data Inventory and Rights

Leena AI maintains an inventory of data used to fine-tune the model. Leena AI has implemented processes to ensure appropriate rights and permissions are obtained for the data used in training and fine-tuning their model.

Their approach involves using three distinct data sources, each with its own considerations for rights and permissions:

Data Sources

1. Manually Curated Internal Data

  • Source: Internal team.
  • Time frame: Collected over the year.
  • Rights consideration: As this data is generated internally, Leena AI likely has full rights to its use.

2. Public Permissible License Data

  • Source: Public datasets.
  • Focus: Specific to use cases Leena AI addresses.
  • Rights consideration: Leena AI uses data with permissible licenses, ensuring legal compliance for its intended use.

3. Synthetic Data

  • Source: Generated from larger open-source models.
  • Rights consideration: As synthetic data based on the Apache License model, this avoids many copyright issues associated with real-world data.

Implications

  • Comprehensive Data Management: By maintaining an inventory of all training and fine-tuning data, Leena AI demonstrates a commitment to responsible AI development and data governance.
  • Legal Compliance: The use of internally generated data, publicly licensed data, and synthetic data suggests a careful approach to avoiding potential copyright or licensing issues.
  • Diverse Data Sources: The combination of internal, public, and synthetic data likely provides a rich and varied dataset for training and fine-tuning, potentially leading to a more robust and versatile AI model.
  • Transparency: Having a complete inventory allows for greater transparency in the AI development process, which can be crucial for audits, regulatory compliance, or client inquiries.
  • Quality Control: An inventory system enables better tracking and management of data quality, allowing for easier identification and correction of any issues in the training data.
  • Ethical Considerations: By carefully sourcing and tracking their data, Leena AI is better positioned to address ethical concerns related to data privacy and responsible AI use.
  • Iterative Improvement: Maintaining a detailed inventory facilitates ongoing refinement of the model, as the team can easily identify which data sources contribute most effectively to model performance.