'Training Data' Collection Process
Data Rights and Permissions
Leena AI has implemented processes to ensure appropriate rights and permissions are obtained for the data used in training and fine-tuning their model. Their approach involves using three distinct data sources, each with its own considerations for rights and permissions.
Data Sources
1. Manually Curated Internal Data
- Source: Internal team.
- Time frame: Collected over the year.
- Rights consideration: As this data is generated internally, Leena AI likely has full rights to its use.
2. Public Permissible License Data
- Source: Public datasets.
- Focus: Specific to use cases Leena AI addresses.
- Rights consideration: Leena AI uses data with permissible licenses, ensuring legal compliance for its intended use.
3. Synthetic Data
- Source: Generated from larger open-source models.
- Rights consideration: As synthetic data based on the Apache License model, this avoids many copyright issues associated with real-world data.
Data Processing
Leena AI has implemented a data pipeline that includes crucial steps to protect privacy and ensure data quality:
- Cleaning: This step likely involves removing irrelevant or low-quality data points.
- Anonymization: This process helps protect individual privacy by removing or obscuring personally identifiable information.
Updated 5 days ago
