Methodology
Let's set the stage with some example code (from Getting Started):
In general, Glass Onion takes a list of SyncableContent and uses the logic in a SyncEngine to sync one pair at a time. The results of all pairs are then merged together and deduplicated. Each object type corresponds to a subclass of SyncEngine that overrides synchronize_pair() to define how pairs are synchronized in synchronize(), which contains wrapper logic for the entire process.
There are three distinct layers within synchronize()'s wrapper logic:

- The aforementioned sync process that results in a data frame of synced identifiers. How each object type is handled is described below.
- Collect remaining unsynced rows and run the sync process on those. Append any newly synced rows to the result dataframe from Layer 1.
- Append any remaining unsynced rows to the bottom of the result data frame.
This result dataframe is then deduplicated: by default, the result dataframe is grouped by the specific columns defined in SyncEngine and the first non-null result is selected for each data provider's identifier field.
Match
NOTE: Match synchronization can be also done using competition context (IE: columns competition_id and season_id, which are assumed to already be synchronized across providers) via use_competition_context (more details on use_competition_context in MatchSyncEngine.init() and the concept of "higher-order" object types on our home page).
- Attempt to join pair using
match_date,home_team_id, andaway_team_id. - Account for matches with different dates across data providers (timezones, TV scheduling, etc) by adjusting
match_datein one dataset in the pair by -3 to 3 days, then attempting synchronization usingmatch_date,home_team_id, andaway_team_idagain. This process is then repeated for the other dataset in the pair. - Account for matches postponed to a different date outside the [-3, 3] day range by attempting synchronization using
matchday,home_team_id, andaway_team_id.
Team
NOTE: Team synchronization can be also done using competition context (IE: columns competition_id and season_id, which are assumed to already be synchronized across providers) via use_competition_context (more details on use_competition_context in TeamSyncEngine.init() and the concept of "higher-order" object types on our home page).
- Attempt to join pair simply on
team_name. - With remaining records, attempt to match via cosine similarity using a minimum threshold of 75% similarity.
- For any remaining records, attempt to match via cosine similarity using no minimum similarity threshold.
Player
NOTE: PlayerSyncEngine ignores syncable columns that have unreliable data (IE: NULLs/NAs in jersey_number or birth_date). The process below describes the best-case scenario. Please set verbose_log=True when creating a PlayerSyncEngine instance to see the full synchronization process.
- Attempt to join pair using
player_namewith a minimum 75% cosine similarity threshold for player name. Additionally, require thatjersey_numberandteam_idare equal for matches that meet the similarity threshold. - Account for players with different birth dates across data providers (timezones, human error, etc) by adjusting
birth_datein one dataset in the pair by -1 to 1 days and/or swapping the month and day, then attempting synchronization usingbirth_date,team_id, and a combination ofplayer_nameandplayer_nickname. This process is then repeated for the other dataset in the pair. - Attempt to join remaining records using combinations of
player_nameandplayer_nicknamewith a minimum 75% cosine similarity threshold for player name. Additionally, require thatteam_idis equal for matches that meet the similarity threshold. - Attempt to join remaining records using "naive similarity": looking for normalized parts of one record's
player name(orplayer_nickname) that exist in another's. Additionally, require thatteam_idis equal for matches found via this method. - Attempt to join remaining records using combinations of
player_nameandplayer_nicknamewith no minimum cosine similarity threshold. Additionally, require thatteam_idis equal.