There are some serious amounts of (des)information in this thread... Both regarding how chroma subsampling schemes are laid out, and how normal Bayer based sensors record video images in consumer-grade DSLR devices.
Now, in order from capture to recording:
Each manufacturer (and also separate models of cameras!) have their own way to read a sensor to create an initial image to "build" the video image from.
*Some cameras line-skip, since that's a very easy (and bandwidth-economically good) way to read a sensor FAST. Unfortunately, this gives lots of noise (much of the actual recorded information on the sensor is just thrown out unused) and lots of orientation dependent aliasing. This exact aliasing depends on how the manufacturer choses to scale the image from the original resolution of the sensor.
*Some cameras have other means of restricting the amount of pixels/second it has to read. Those can include true binning, patterned subsampling and many other schemes. Most of those schemes can also be reverse-engineered if you know what you're doing.
Almost ALL consumer-oriented cameras do after this initial sifting of information. Most of the quality loss except for the compression and chroma sub-sampling occurs here! At the second leg in the image pipeline you have a complete RGB 4-4-4 image at some (smaller) pixel scale. This depends on the original resolution of the sensor vs the subsampling method chosen. But it's often around 1200-1350 pixels on the X-axis and 800-950 pixels on the Y-axis.
Those are true 4-4-4 RGB images! But they aren't true HD resolution... Which is why most DSLR images are quite a lot softer and less detailed than true 1080p video.
The video compression engine accepts RGB as input, doing YCbCr(YUV) transform before sending the image stream to the encoder is just a big waste of effort. In the encoder input, the 4-4-4 RGB image is subsampled into 1920x1080 Y-channel data and [some] resolution CbCr(UV) data before it's sent in to the compression encoding.
So, no - you don't need an area of 4x4 Bayer-coded pixels to make 4-4-4 video. You need 2x2 pixels to get full-resolution images for ALL CHANNELS, the definition of 4-4-4. If you do a Bayer interpolation before coding, you only need ONE pixel to get 4-4-4 video... See Nikon D4 1:1 crop video mode next.
One instance is the 1:1 pixel crop video you can get from a Nikon D4. The video image in that mode is made from a 1920x1080 crop from the central part of the sensor. That image needs to be Bayer-interpolated before it's a full RGB image, but at least it's a full-res image. That's why it's so much better than the large modes in that camera - the large crop modes use line-skipping and lower actual image resolution. The loss from that is much bigger than the loss from having to do a Bayer interpolation.
If you want to read more about chroma subsampling and the exact layouts used, I'd recommend this: