Lecture 9. CNN Architectures

Last updated on Dec 16, 2023 2 min read Tutorial

Table of Contents

CS231n 课程的官方地址：http://cs231n.stanford.edu/index.html
该笔记根据的视频课程版本是 Spring 2017(BiliBili)，PPt 资源版本是 Spring 2018.
另有该 Lecture 9. 扩展讲义资料：
AlexNet
VGGNet, GoogLeNet, ResNet

Review: LeNet-5

[LeCun et al., 1998]

Case Studies

AlexNet

[Krizhevsky et al. 2012]

第一个在 ImageNet 的分类比赛中获得成功的大型卷积神经网络。

ZFNet

[Zeiler and Fergus, 2013]

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

VGG

[Simonyan and Zisserman, 2014]

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

Q：Why use smaller fileters? (3x3 conv)
- Stack of three 3x3 conv (stride 1) layers has same effective receptive field as one 7x7 conv layer
- But deeper, more non-linearities
- And fewer parameters: $3 * (3^2C^2)$ vs. $7^2C^2$ for C channels per layer
Q: What is the effective receptive field of three 3x3 conv (stride 1) layers?
- See this post:

Details
- ILSVRC'14 2nd in classification, 1st in localization
- Similar training procedure as Krizhevsky 2012
- No Local Response Normalisation (LRN)
- Use VGG16 or VGG19 (VGG19 only slightly better, more memory)
- Use ensembles for best results
- FC7 features generalize well to other tasks

GoogLeNet

[Szegedy et al., 2014]

Deeper networks, computational efficiency

“Inception module”
- design a good local network topology (network within a network) and then stack these modules on top of each other
- Apply parallel filter operations on the input from previous layer:
  - Multiple receptive field sizes for convolution (1x1, 3x3, 5x5)
  - Pooling operation (3x3)
- Concatenate all filter outputs together depth-wise
- Q：What is the problem with this?
  - Solutions：“bottlenect” layers that use 1x1 convolutions to reduce feature depth

最后，

ResNet

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

[He et al., 2015]
What happens when we continue stacking deeper layers on a “plain” convolutional neural networks?
- The deeper model performs worse, but it’s not caused by overfitting!
Hypothesis: the problem is an optimization problem, deeper models are harder to optimize.
- The deeper model should be able to perform at least as well as the shallower model.
- A solution by construction is copying the learned layers from the shallower model and setting additional layers to identity mapping.
**Solution：**Use network layers to fit a residual mapping instead of directly trying to fit a desired underlying mapping.