Sorting & Ranking¶
Ordering a DataFrame by column values (or index), and assigning ranks.
Sort by a single column — sort_values¶
import pandas as pd
df = pd.DataFrame({
"name": ["Alice","Bob","Carol","Dave","Eve"],
"age": [25, 30, 35, 28, 22],
"salary": [50000, 80000, 75000, 60000, 90000],
})
# Ascending (default)
print(df.sort_values("age"))
print()
# Descending
print(df.sort_values("salary", ascending=False))
Sort by multiple columns¶
import pandas as pd
df = pd.DataFrame({
"name": ["Alice","Bob","Carol","Dave","Eve","Frank"],
"city": ["Mumbai","Delhi","Mumbai","Delhi","Pune","Mumbai"],
"salary": [50000, 80000, 75000, 60000, 90000, 100000],
})
# Sort by city (asc), then within each city by salary (desc)
print(df.sort_values(
by=["city", "salary"],
ascending=[True, False],
))
Sort by index — sort_index¶
import pandas as pd
df = pd.DataFrame({
"val": [3, 1, 2, 4]
}, index=["d", "b", "c", "a"])
print(df.sort_index())
print()
# Descending
print(df.sort_index(ascending=False))
NaN handling¶
By default NaN goes to the bottom. Override with na_position:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"name": ["Alice","Bob","Carol","Dave"],
"score": [85, np.nan, 90, 70],
})
print("NaN at bottom (default):")
print(df.sort_values("score"))
print()
print("NaN at top:")
print(df.sort_values("score", na_position="first"))
In-place sort¶
import pandas as pd
df = pd.DataFrame({"x": [3, 1, 4, 1, 5]})
# Default — returns a new sorted DataFrame
sorted_df = df.sort_values("x")
print("original:", df["x"].tolist())
print("sorted :", sorted_df["x"].tolist())
# In-place
df.sort_values("x", inplace=True)
print("after in-place:", df["x"].tolist())
Reset the index after sorting¶
Sorting reshuffles rows but keeps the old index. Often you want a clean integer index:
import pandas as pd
df = pd.DataFrame({
"x": [3, 1, 4, 1, 5],
})
sorted_df = df.sort_values("x").reset_index(drop=True)
print(sorted_df)
drop=True discards the old index. Without it, the old index becomes a new "index" column.
Sort with a custom key¶
Sort by string length, by month name, by anything computable:
import pandas as pd
df = pd.DataFrame({
"name": ["bee", "Aardvark", "cat", "Donkey", "elephant"],
})
# Case-insensitive
print(df.sort_values("name", key=lambda s: s.str.lower()))
print()
# By name length
print(df.sort_values("name", key=lambda s: s.str.len()))
Top / Bottom N — nlargest / nsmallest¶
Faster than sort + slice when you only need the top N:
import pandas as pd
df = pd.DataFrame({
"name": ["Alice","Bob","Carol","Dave","Eve","Frank","Grace","Henry"],
"salary": [50000, 80000, 75000, 60000, 90000, 100000, 70000, 110000],
})
# Top 3 by salary
print(df.nlargest(3, "salary"))
print()
# Bottom 3 by salary
print(df.nsmallest(3, "salary"))
Sorting categorical with custom order¶
If a column has a logical order (small < medium < large), tell Pandas:
import pandas as pd
df = pd.DataFrame({
"size": ["medium", "small", "large", "small", "large", "medium"],
"price": [25, 10, 50, 12, 55, 23],
})
# Default: alphabetical (large, medium, small) — wrong!
print(df.sort_values("size"))
print()
# Categorical with explicit order
df["size"] = pd.Categorical(df["size"], categories=["small","medium","large"], ordered=True)
print(df.sort_values("size"))
Ranking — .rank()¶
Assign ranks to values:
import pandas as pd
df = pd.DataFrame({
"name": ["Alice","Bob","Carol","Dave"],
"score": [85, 92, 78, 92], # Bob and Dave tied
})
df["rank"] = df["score"].rank(ascending=False)
df["rank_min"] = df["score"].rank(ascending=False, method="min")
df["rank_dense"] = df["score"].rank(ascending=False, method="dense")
print(df)
| Method | Behavior with ties |
|---|---|
average (default) |
average rank of tied positions (e.g. 1.5, 1.5) |
min |
lowest of the tied ranks (e.g. 1, 1, 3) |
max |
highest of the tied ranks |
first |
order ties by appearance |
dense |
like min but no gaps in the rank sequence |
Reverse a DataFrame¶
import pandas as pd
df = pd.DataFrame({"x": [1, 2, 3, 4, 5]})
print(df[::-1]) # reverse rows
print()
print(df.iloc[::-1]) # same — explicit
Sorting groups — common pattern¶
Within each city, sort employees by salary descending:
import pandas as pd
import numpy as np
rng = np.random.default_rng(0)
df = pd.DataFrame({
"name": [f"User{i}" for i in range(1, 11)],
"city": rng.choice(["Mumbai","Delhi","Pune"], size=10),
"salary": rng.integers(40_000, 200_000, size=10),
})
result = df.sort_values(["city", "salary"], ascending=[True, False])
print(result)
Top-K per group¶
Use groupby + head (after sorting):
import pandas as pd
import numpy as np
rng = np.random.default_rng(0)
df = pd.DataFrame({
"city": rng.choice(["Mumbai","Delhi","Pune"], size=15),
"name": [f"User{i}" for i in range(1, 16)],
"salary": rng.integers(40_000, 200_000, size=15),
})
# Top 2 highest-paid per city
result = (
df
.sort_values("salary", ascending=False)
.groupby("city")
.head(2)
)
print(result.sort_values(["city", "salary"], ascending=[True, False]))
Cheatsheet¶
| Goal | Code |
|---|---|
| Sort by a column | df.sort_values("c") |
| Sort descending | df.sort_values("c", ascending=False) |
| Sort by multiple | df.sort_values(["c1","c2"], ascending=[True,False]) |
| Sort by index | df.sort_index() |
| Reset index | .reset_index(drop=True) |
| Top-N rows | df.nlargest(n, "c") |
| Bottom-N rows | df.nsmallest(n, "c") |
| Rank a column | df["c"].rank() |
| Custom key | df.sort_values("c", key=lambda s: s.str.lower()) |
| NaN placement | na_position="first" or "last" |
| Top-N per group | df.sort_values(...).groupby("g").head(n) |
Common pitfalls¶
- ❗ Sort doesn't modify by default — assign the result or use
inplace=True. - ❗ String columns sort alphabetically — "10" sorts before "2". Convert to numeric or pass a key.
- ❗ Old index sticks after sorting — use
.reset_index(drop=True)if you want a clean integer index. - ❗
rank()returns floats by default — usemethod="min"and cast to int if you want plain ranks. - ❗ Multiple-column sort with mixed
ascending— pass a list matching the columns.
Practice¶
What does this print?
Expected: [1, 2, 3]
Get the top 3 highest salaries (use nlargest)
Expected: [110000, 100000, 90000]
Quiz — Quick check¶
What you remember
Q1. What does df.sort_values("x") return?
- Modifies
dfin place and returns None - A new sorted DataFrame —
dfis unchanged - An error if
xhas NaN - A sorted Series
Why: All Pandas methods are immutable by default. To modify in place, pass
inplace=Trueor reassign:df = df.sort_values("x").
Q2. Where does NaN go in a default ascending sort?
- Top
- Bottom
- Removed
- Raises an error
Why: Pandas sorts NaN to the bottom in both ascending and descending order. Override with
na_position="first".
Q3. What's faster than df.sort_values("x", ascending=False).head(3) for "top 3 by x"?
-
df.head(3).sort_values("x")(wrong result) -
df.nlargest(3, "x") -
df[df["x"] > df["x"].mean()] -
df.sort_index().head(3)
Why:
nlargestuses a partial-sort algorithm — O(N log K) instead of O(N log N). For huge DataFrames where K is small, the difference is significant.
Common doubts¶
Why does my index look weird after sorting?
Because sorting reorders rows but keeps their original labels. The first row in the result might have index 42 if that row was sorted to the top. Add .reset_index(drop=True) for a clean 0..N-1 index.
How do I sort a string column case-insensitively?
Pass a key= argument: df.sort_values("name", key=lambda s: s.str.lower()). Without the key, "Zebra" comes before "alpha" because uppercase letters have lower Unicode codepoints than lowercase.
What's rank for?
Assigning a position (1st, 2nd, 3rd) within a Series. Used for percentile calculations, leaderboards, and ML feature engineering. Pay attention to method= for tie handling — min, dense, and average give different answers.