How to handle ambiguous matches¶
Some inputs genuinely match more than one entity. "Congo" maps to both the Democratic Republic of the Congo (country/COD) and the Republic of the Congo (country/COG) at equal confidence. resolvekit doesn't silently pick one — it tells you there's a collision and lets you decide what to do.
The three behaviors¶
resolve_id controls ambiguity through on_ambiguous. The default is "raise".
Raise (default) — catch and inspect¶
import resolvekit as rk
from resolvekit import AmbiguousResolutionError
try:
rk.resolve_id("Congo")
except AmbiguousResolutionError as err:
print(f"{len(err.candidates)} candidates:")
for c in err.candidates[:2]:
print(f" {c.entity_id:<20} {c.canonical_name}")
Output:
err.candidates is a list of CandidateSummary objects, each with .entity_id, .canonical_name, and .confidence. Use it to route the input to a human-review queue or to build a correction map.
Null — return None and handle downstream¶
Use "null" in pipelines where unresolved rows are handled by a later step — for example, logged for manual review or skipped entirely.
Best — take the top candidate¶
"best" returns whichever candidate has the highest confidence score. When two candidates are tied — as COD and COG are — the tie is broken by internal ranking heuristics, not a guarantee. Use "best" only when a wrong answer is better than no answer (reporting, fuzzy deduplication), not when accuracy matters.
Inspecting ambiguity without resolving¶
rk.resolve() returns a ResolutionResult whether or not the input is ambiguous. Check .is_ambiguous before acting on the result:
import resolvekit as rk
r = rk.resolve("Congo")
r.is_ambiguous # True
r.status # 'ambiguous'
# Inspect candidates
for c in r.candidates[:2]:
print(f"{c.entity_id} ({c.canonical_name}) conf={c.confidence:.2f}")
Output:
r.best_candidate returns the first candidate without committing to a resolution:
Exploring candidates before deciding¶
When you're not sure which entities could match a term, use diagnostics.search to run a dry lookup against the loaded packs:
Output:
[
CandidateSummary('country/COD', conf=0.91 [geo] (3 evidence)),
CandidateSummary('country/COG', conf=0.91 [geo] (3 evidence)),
]
This is useful interactively when building a correction map — you can see all plausible matches and decide which entity IDs to assign by hand.
Narrowing with context¶
ResolutionContext lets you constrain the candidate set before resolution. entity_types filters to a specific entity type, which prunes candidates that don't match — useful when a query like "Congo" could match both countries and sub-national regions depending on loaded packs:
import resolvekit as rk
from resolvekit import ResolutionContext
ctx = ResolutionContext(entity_types=["geo.country"])
r = rk.resolve("Congo", context=ctx)
r.status # 'ambiguous'
for c in r.candidates:
print(f"{c.entity_id} {c.canonical_name}")
Output:
The input is still ambiguous — context can't pick between two genuine countries with the same name — but the noise from sub-national regions is gone. A follow-up call with on_ambiguous="best" or a correction map can finish the job.
Why
Context doesn't lower confidence thresholds or guess. It prunes candidates that fail a hard constraint (wrong type, wrong parent). If two candidates both pass, resolution stays ambiguous.
Snapping to a known candidate list¶
When you already know the valid options from a dataset or a previous lookup, use rk.snap to pick the closest match from that list:
import resolvekit as rk
r = rk.resolve("Congo")
candidate_ids = [c.entity_id for c in r.candidates[:2]]
# ['country/COD', 'country/COG']
rk.snap(query="DR Congo", candidates=candidate_ids)
# 'country/COD'
rk.snap(query="Republic of Congo", candidates=candidate_ids)
# 'country/COG'
snap resolves each candidate from your list and returns the one whose name or aliases most closely match the query. It returns None when nothing clears the max_distance threshold (default 0.5).
Handling ambiguity in bulk¶
rk.bulk defaults to on_ambiguous="null" — ambiguous rows become None rather than raising:
import pandas as pd
import resolvekit as rk
countries = pd.Series(["United States", "Congo", "Germany"])
iso3s = rk.bulk(values=countries, to="iso3")
# 0 USA
# 1 None
# 2 DEU
# dtype: object
# Flag the rows that need attention
ambiguous_mask = countries.notna() & iso3s.isna()
print(countries[ambiguous_mask].to_list())
# ['Congo']
Switch to on_ambiguous="raise" if you want the job to fail fast on any ambiguous input, or "best" to accept the top candidate throughout.
Heads up
"best" in bulk commits to the top-ranked candidate for every ambiguous row, with no per-row review. Use rk.resolve(value).candidates to inspect a sample of ambiguous inputs before enabling "best" on data where entity accuracy matters.
Next¶
- Resolver reference —
ResolutionContextfields andon_ambiguousacross all resolver methods. - How resolution works — why two candidates can score identically and how the pipeline decides what to surface as ambiguous versus resolved.