'Training Data' Collection Process

Data Rights and Permissions

Leena AI has implemented processes to ensure appropriate rights and permissions are obtained for the data used in training and fine-tuning their model. Their approach involves using three distinct data sources, each with its own considerations for rights and permissions.


Data Sources

1. Manually Curated Internal Data

  • Source: Internal team.
  • Time frame: Collected over the year.
  • Rights consideration: As this data is generated internally, Leena AI likely has full rights to its use.

2. Public Permissible License Data

  • Source: Public datasets.
  • Focus: Specific to use cases Leena AI addresses.
  • Rights consideration: Leena AI uses data with permissible licenses, ensuring legal compliance for its intended use.

3. Synthetic Data

  • Source: Generated from larger open-source models.
  • Rights consideration: As synthetic data based on the Apache License model, this avoids many copyright issues associated with real-world data.

Data Processing

Leena AI has implemented a data pipeline that includes crucial steps to protect privacy and ensure data quality:

  • Cleaning: This step likely involves removing irrelevant or low-quality data points.
  • Anonymization: This process helps protect individual privacy by removing or obscuring personally identifiable information.