This is called image segmentation, it's also used eg to detect faces in photo software.
You start out with making a training set. This would be a set of images where the rectangles are labeled.
I've seen two approaches depending on what you want. One option is that you provide a Boolean mask of the same size as your image where each pixel is true or false as being part of a rectangle. This makes it a classification task, the model will output for each pixel whether it's part of a rectangle or not, and you'll use cross entropy as loss function as a measure of how good it it. Here is an example of what you would give as an example during training. Btw this is called "supervised learning" because you give clear examples of exactly what you want the network to do. In this example below they are trying to teach it to detect bicycles and riders.
A second method is to provide corner coordinates of a box in which your doors/windows are. The target output of the network is 4 floats, and this makes is a regression task. A common loss function is to minimize the mismatch between the rectangle the model suggests and the true rectangle. This is called "intersect over union"
Once you have a training set of examples you then typically use a deep "convolutional" neural network to learn the relationship between the input and target output. The network build their own filters during training, things like edge detectors. This is a good example of the benefits of deep neural networks. In the old days "domain experts" would build their own filters for feature extraction, which was laberous and suboptimal. Interestingly these filters tge neural network creates typically end up being things that are well known, eg they always end up creating Gabor filters in the first layers, and more complicated concept in higher layers
