How to clean a country column in a DataFrame¶
Turn a messy country column into canonical ISO codes or entity IDs in one call. Polars works identically — swap pd.Series for pl.Series and the rest is the same.
Before you start¶
uv add 'resolvekit[pandas]'/pip install 'resolvekit[pandas]'(orresolvekit[polars]for Polars)- Bundled geo data ships in the wheel — no download needed for country-level resolution
The result¶
Given this input:
| # | country |
|---|---|
| 0 | United States |
| 1 | Cote dIvoire |
| 2 | Brasil |
| 3 | Germany |
| 4 | n/a |
rk.bulk produces:
| # | country | iso3 |
|---|---|---|
| 0 | United States | USA |
| 1 | Cote dIvoire | CIV |
| 2 | Brasil | BRA |
| 3 | Germany | DEU |
| 4 | n/a | None |
n/a hits the sentinel blocklist and returns None without touching the resolution pipeline. Typo variants like Brasil and diacritic-stripped forms like Cote dIvoire resolve through fuzzy matching.
The recipe¶
import pandas as pd
import resolvekit as rk
df = pd.DataFrame({
"country": ["United States", "Cote dIvoire", "Brasil", "Germany", "n/a"]
})
df["iso3"] = rk.bulk(values=df["country"], to="iso3")
to="iso3" returns a pd.Series in the same order as the input, with None for any row that didn't resolve. Other valid pivots: "iso2", "name", "flag", "dcid", "aliases".
Use bulk, not apply¶
Avoid .apply for country columns:
# ✅ Deduplicates automatically — resolves each unique value once
df["iso3"] = rk.bulk(values=df["country"], to="iso3")
# ❌ Resolves every row, including duplicates
df["iso3"] = df["country"].apply(lambda x: rk.resolve_id(x, on_ambiguous="null"))
Survey and donor datasets repeat the same country names thousands of times. bulk deduplicates before resolving: 100 000 rows with 20 unique countries costs about the same as resolving those 20 countries once. The speedup scales with repetition in your column — the higher the row-to-unique ratio, the more work bulk eliminates versus row-by-row apply.
The apply path also returns entity IDs (country/USA), not code values, which means you'd need a separate pivot step anyway.
Handle unresolved rows¶
By default, unresolved rows get None. Use not_found to replace them with a literal sentinel instead:
df["iso3"] = rk.bulk(values=df["country"], to="iso3", not_found="???")
# n/a -> "???" instead of None
Any string works. not_found="raise" raises ResolutionError on the first miss — useful when your input is supposed to be clean.
Handle ambiguous names¶
Some names match multiple entities. Congo resolves to both the Democratic Republic and the Republic of Congo. The default on_ambiguous="null" returns None for these rows. Use "best" to take the top-ranked candidate:
df["iso3"] = rk.bulk(values=df["country"], to="iso3", on_ambiguous="best")
# Congo -> "COD" (Democratic Republic of the Congo)
Use "raise" to surface them as AmbiguousResolutionError during a data-quality pass.
Inspect failures¶
to="iso3" gives you codes directly but hides per-row detail. Pass to=None to get a BulkResult with full diagnostics:
result = rk.bulk(values=df["country"], to=None)
# BulkResult(total=5, resolved=4, no_match=1, ambiguous=0, error=0, kind='pandas')
summary = result.summary()
# ResolutionSummary(total=5, resolved=4, ambiguous=0, no_match=1, error=0)
detail = result.unnest()
# pd.DataFrame with columns: status, entity_id, confidence, pack_id, query_text
result.failures gives a sub-result containing only the non-resolved rows, so you can inspect or re-attempt just the misses.
Input code systems¶
If your column already contains ISO 2-letter codes, skip fuzzy matching entirely:
df["iso3"] = rk.bulk(values=df["iso2_col"], to="iso3", from_system="iso2")
# "US" -> "USA", "DE" -> "DEU", "FR" -> "FRA"
Common from_system values: "iso2", "iso3", "iso_numeric", "dcid", "wikidata". Any code system the loaded packs carry works — run rk.default().code_systems() for the full list.
Next¶
bulkreference — full parameter list, output shapes, andBulkResultAPI.- Handle ambiguous matches — inspect candidates, override with context, and decide a tiebreak policy.
- Convert between code systems — pivot between ISO 2, ISO 3, numeric, DCID, Wikidata, and other code systems.
- Extract entities from free text — when your column contains running text rather than clean country names, use
parse()/parse_bulk()to detect and link every entity mention with character offsets.