If you let each pixel vary independently, the space of possible 1024x1024 images is 1,048,576-dimensional, but the vast hypermajority of those images aren't photorealistic human faces. Letting each pixel vary independently is the wrong way to think about it: changing the lighting or pose changes a lot of pixels in what humans would regard as images of "the same" face. So instead, our machine-learning algorithms learn a [compressed](https://www.lesswrong.com/posts/ex63DPisEjomutkCw/msg-len) representation of what makes the tiny subspace (relative to images-in-general) of _faces in particular_ similar to each other. That [latent space](https://towardsdatascience.com/understanding-latent-space-in-machine-learning-de5a7c687d8d) is a lot smaller (say, 512 dimensions), but still rich enough to embed the distinctions that humans notice: [you can find a hyperplane that separates](https://youtu.be/dCKbRCUyop8?t=1433) smiling from non-smiling faces, or glasses from no-glasses, or young from old, or different races—or female and male. Sliding along the [normal vector](https://en.wikipedia.org/wiki/Normal_(geometry)) to that [hyperplane](https://en.wikipedia.org/wiki/Hyperplane) gives the desired transformation: producing images that are "more female" (as the model has learned that concept) while keeping "everything else" the same.
If you let each pixel vary independently, the space of possible 1024x1024 images is 1,048,576-dimensional, but the vast hypermajority of those images aren't photorealistic human faces. Letting each pixel vary independently is the wrong way to think about it: changing the lighting or pose changes a lot of pixels in what humans would regard as images of "the same" face. So instead, our machine-learning algorithms learn a [compressed](https://www.lesswrong.com/posts/ex63DPisEjomutkCw/msg-len) representation of what makes the tiny subspace (relative to images-in-general) of _faces in particular_ similar to each other. That [latent space](https://towardsdatascience.com/understanding-latent-space-in-machine-learning-de5a7c687d8d) is a lot smaller (say, 512 dimensions), but still rich enough to embed the distinctions that humans notice: [you can find a hyperplane that separates](https://youtu.be/dCKbRCUyop8?t=1433) smiling from non-smiling faces, or glasses from no-glasses, or young from old, or different races—or female and male. Sliding along the [normal vector](https://en.wikipedia.org/wiki/Normal_(geometry)) to that [hyperplane](https://en.wikipedia.org/wiki/Hyperplane) gives the desired transformation: producing images that are "more female" (as the model has learned that concept) while keeping "everything else" the same.