encoder-only (BERT-style):
looks at the full sentence at once, masks random tokens, and tries to reconstruct them.
this is masked language modeling, and it allows bidirectional context the model uses both left and right of the [MASK].
better for tasks like sentence classification.