Data validation for Partners
Overview
We are introducing a new data staging process, enabling both developers and owners to verify and approve datasets before delivery.
The system will facilitate error handling, validation checks, and customization, ensuring accuracy and reliability.
It will help everyone save time, reduce the number of open tickets, and keep the needed level of quality to maintain the level of quality we want.
How does it work?
Once the dataset snapshot is ready, we run its validation tests.
The customer will get the dataset snapshot with an indication on the CP that all test was passed.
The customer will get the dataset snapshot with an indication on the CP that all test was passed.
The partner will review the issues and will choose:
- To fix the failed tests
- To deliver the snapshot as it is to the customer (with an explanation of why this test failed but still override).
The customer can then decide (per fail test or as a batch):
- If he is willing to accept it as it is
- Accept it as it is only for this specific snapshot
- Reject it and return it to the partner to fix what needs to be fixed
Once the customer approves, the snapshot goes to the delivery phase.
The validation tests
Uniqueness
Uniqueness
Ensures a minimum percentage of unique values in the dataset.
Filling Rate
Filling Rate
Mandates a minimum percentage for filled values.
Persistence Validation
Persistence Validation
Makes a field mandatory once filled; triggers an error if left empty afterward.
Data Stability
Data Stability
The value number must not change by more than X amount compared to previous values.
Type Verification
Type Verification
Checks each entry’s data type against its field type (e.g., string, number, date) to ensure integrity and flag mismatches for correction before processing.
Schema and Field Custom Validation
Schema and Field Custom Validation
Establish a custom rule to validate if the specific field exists and the field value is valid, such as requiring the size string to be ‘S’, ‘M’, or ‘L’; any other value is considered an error.
Minimum Records Threshold
Minimum Records Threshold
Requires a minimum of X records for the initial dataset (in specific URL should per the minimum of X records from the total URL inputs)
Data Size Fluctuation Threshold
Data Size Fluctuation Threshold
Validates fluctuations within a +/- X% range.
Main components and functionality
When clicking on the specific snapshot to review it
Dataset test view
Dataset test view
There are three filter options (All results, Passed, Failed) for dataset test view
Reparse button
Reparse button
Ignore button
Ignore button
“Show related records” and “View & Edit code”
“Show related records” and “View & Edit code”
What is the difference between Reparse or Recrawl?
What is the difference between Reparse or Recrawl?
Reparse:
Reparse involves reprocessing the existing raw data that has already been collected. It applies the parsing logic, rules, or transformations to the data without fetching fresh data from the web. Reparse is useful when changes are needed in data structure, new fields are added to parsing logic, or existing data needs to be restructured or re-extracted.
Recrawl:
Recrawl involves revisiting the source website to collect new or updated data. It fetches fresh data directly from the source to reflect recent changes or additions on the website. Recrawling is useful when the content on the source website changes frequently, or when up-to-date data is critical for analysis or reporting.
Reparse or Recrawl buttons
Reparse or Recrawl buttons
Rerun test
Rerun test
This will allow you to run the validation tests again if needed
IDE button
IDE button
Deliver Dataset button
Deliver Dataset button
Rejected tests
Rejected tests
In case not all rejected test was accepted by the owner, the issues will be sent back to you and will be marked with the “Rejected” label for additional fixes and re-sending to the owner for approval
Communications and notifications
Status changes and additional notifications will be presented to you as a megaphone on the CP.
Tickets and bugs
Now, once we introduce the staging process, fixing a collector is not the end of the bug/issue
The process consists of two steps:
- Fix the collector
- Fix the snapshot
Hence, the bug processing workflow will be amended to align with the new 2-step process.
Changes to the flow:
We are removing the “Resolve” button from the option on the ticket