Explain
Intuition
Unlike part (a) where the objective was purely , the objective in part (b) takes the form . We can split it into two competing forces:
- The Linear Pull (): This part wants to place all the probability mass on the single that has the largest score. If this term was alone, the solution would be purely deterministic (assigning to the winner and to everything else).
- The Entropy Push (): This is the formula for Shannon Entropy. Entropy measures uncertainty or "smoothness." Maximizing entropy pushes the probabilities to be as uniform (spread out) as possible, actively resisting the urge to group all probability mass onto a single winner.
By combining these two, the optimization problem behaves as a trade-off: Favor the states with high , but maintain a diverse distribution.
The Softmax Function
The result directly yields the Softmax Function:
This is an incredibly ubiquitous function in Machine Learning (particularly neural networks) used to transform arbitrary scores (logits) into a valid probability distribution.
- Non-negativity: The exponential function ensures that any negative logic score becomes a slightly positive probability.
- Sum to 1: The denominator exactly counterweighs the numerators so they act as relative fractions of the whole.
graph TD
A["Objective: $\sum \pi_j (N_j - \log \pi_j)$"] --> B["Linear Part: $\pi_j N_j$<br>(Favors large $N_j$)"]
A --> C["Entropy Part: $-\pi_j \log \pi_j$<br>(Encourages diversity)"]
B --> D{Constraint: Sum to 1}
C --> D
D --> E[Softmax Distribution]
E --> F["$$\pi_j \propto \exp(N_j)$$"]
Common Pitfalls
- Logarithm Rules: A frequent error takes place during the differentiation of . Remembering to apply the Product Rule () avoids the common mistake of concluding the derivative is merely roughly. Instead, it yields .
- Mistaking the constraint: Similar to part (a), the multiplier must be factored out carefully using exponent rules: separates multiplicatively rather than additively.