(ICML 2019) EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks


EfficientNet is a highly influential paper that has gained significant attention in the field of image classification due to its outstanding performance.

For projects requiring extensive training time or computational resources, EfficientNet serves as a valuable approach to enhancing ConvNet performance. It provides an efficient and scalable method for training convolutional neural networks while optimizing accuracy and computational cost, making it highly applicable for real-world AI deployment.


๐Ÿ”— Research Paper: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks


๐Ÿ“Œ Key Resources & Reviews

๐Ÿ“– Paper Review Summaries:

๐Ÿ’ป Source Code (PyTorch Implementation):

 

ABSTRACT

CNN์€ ํ•œ์ •๋œ ์ž์› ๋‚ด์—์„œ ๊ฐœ๋ฐœ๋˜์–ด์™”์œผ๋ฉฐ, ์ž์›์ด ํ•œ๋„ ๋‚ด์—์„œ ๋” ๋†’์€ ์ •ํ™•๋„๋ฅผ ์œ„ํ•ด์„œ ๊ทธ ํฌ๊ธฐ๋ฅผ ํ‚ค์›Œ๊ฐ€๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๋ฐœ์ „๋˜์–ด์™”๋‹ค.

์ด ๋…ผ๋ฌธ์—์„œ๋Š”, model scaling์— ๋Œ€ํ•ด ๋” ๋ช…ํ™•ํžˆ ๋ฐํ˜€๋‚ด๊ธฐ ์œ„ํ•ด ์—ฐ๊ตฌํ•˜๊ฒŒ ๋˜๋ฉฐ,  network์˜ depth, width, ๊ทธ๋ฆฌ๊ณ  resolution์‚ฌ์ด์˜ ๊ด€๊ณ„์— ๋Œ€ํ•œ ๊ท ํ˜•์„ ๋งž์ถฐ์•ผ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค๋Š”๊ฒƒ์„ ์ฒด๊ณ„์ ์œผ๋กœ ๋ฐํ˜€๋‚ธ๋‹ค. 


์ €์ž ์ œ์•ˆ 

  • 'Compound coefficient' 
    • depth, width, resolution์˜ dimension๋“ค์„ ๊ฐ„๋‹จํ•˜๋ฉด์„œ๋„ ๋†’์€ ํšจ์œจ์„ ๋ณด์ด๋Š” ์ƒˆ๋กœ์šด sacaling๋ฐฉ๋ฒ•์œผ๋กœ, MobileNet๊ณผ ResNet์— ์ด ๋ฐฉ๋ฒ•์„ ์ ์šฉ์‹œ์ผœ๋ด„์œผ๋กœ์จ ํšจ์œจ์„ฑ์„ ํ…Œ์ŠคํŠธํ•œ๋‹ค.

  • EfficientNet 
    • 'Neural Architecture Search(NAS, ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜์œผ๋กœ ์ตœ์ ์˜ network๋ฅผ ์ฐพ๋Š” ๋ฐฉ๋ฒ•)'๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ baseline network๋ฅผ ์„ค๊ณ„ํ•˜์˜€์œผ๋ฉฐ ์ด baeline network๋ฅผ scale up ํ•œ ๊ฐ€์กฑ ๋ชจ๋ธ์ธ ํ•˜์˜€๋‹ค. 
    • EfficientNet-B7 :  ImageNet dataset์— ๋Œ€ํ•ด 84.4%(top-1 acc)/ 97.1%(top-5 acc)๋ฅผ ์–ป์—ˆ์„ ์ •๋„๋กœ ๋งค์šฐ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š”๋ฐ, ์ด๋Š” ์ตœ์‹  ConvNet๋ณด๋‹ค 8.4๋ฐฐ ์ž‘์œผ๋ฉฐ, 6.1๋ฐฐ ๋น ๋ฅธ ์„ฑ๋Šฅ์ด๋‹ค.

INTRODUCTION

Several common techniques are used to enhance ConvNet performance:

๐Ÿ”น Scaling Up Models by Increasing Layers

  • Increasing the number of layers has been a widely adopted method.
  • ResNet (He, 2016) improved accuracy by scaling from ResNet-18 to ResNet-200 through deeper layers.
  • GPipe (Huang, 2018) scaled a baseline model 4x larger and achieved 84.3% top-1 accuracy on ImageNet.
  • However, an optimal scaling approach for ConvNets remains poorly understood.

๐Ÿ”น Alternative Scaling Methods

  • Increasing depth (more layers) or width (more channels per layer) are the most common scaling approaches.
  • Adjusting a model’s input image resolution is a lesser-known but increasingly popular technique.
  • Some approaches attempt to combine multiple scaling dimensions, but historically, most methods have focused on modifying only one at a time.

Author’s Motivation

๐Ÿ”น Compound Scaling Method

  • "Is there a theoretically grounded method to scale up ConvNets for better performance?"
  • The authors argue that balancing depth, width, and resolution is critical for performance improvement.
  • They demonstrate that this balance can be determined using simple constant ratios.
  • Unlike previous heuristic approaches, EfficientNet scales network dimensions uniformly, rather than making arbitrary adjustments to individual factors.

๐Ÿš€ Conclusion:
The EfficientNet scaling approach introduces a principled method for improving ConvNet performance, leading to better efficiency and accuracy compared to conventional scaling practices.




For example, if we want to design a model that is 2^N times larger, we scale the baseline network by adjusting:

  • Depth by a^N,
  • Width by b^N,
  • Image size by c^N.

A small grid search is conducted to find the optimal a, b, and c values that satisfy these conditions.

The figure below illustrates the concept of compound scaling.

 

Intuitively, compound scaling is expected to yield good results because:

  • When the input image size increases, the network needs to acquire a larger receptive field to capture a broader area.
  • More channels are required to extract refined patterns effectively.

Additionally, in this paper, the authors quantitatively analyze, for the first time, the relationship between network width, depth, and resolution.

Since performance improvements from model scaling heavily depend on the baseline network, the authors use Neural Architecture Search (NAS) to establish an optimal baseline network.


COMPOUND MODEL SCALING


Model scaling refers to expanding the length, width, and resolution of an existing baseline network, rather than focusing on designing the optimal architecture from scratch, as seen in other ConvNet design approaches.

To narrow the design space, the authors propose a uniform scaling strategy where all layers are scaled proportionally.

This approach formulates an optimization problem aimed at maximizing accuracy within limited computational resources.

Mathematically, this can be expressed as follows:

  • F, L, H, W, and C are determined by the baseline network.
  • w, d, and r are the scaling coefficients applied to the network.




The most critical challenge is that the optimal scaling coefficients (d, w, r) are interdependent and are subject to different resource constraints.

As a result, conventional ConvNets have typically scaled only one of the following dimensions:

  1. Depth (d): Increasing the number of layers (e.g., ResNet-100 → ResNet-1000)
  2. Width (w): Expanding the number of channels per layer
  3. Resolution (r): Enlarging the input image size from M×M to rM × rM

While increasing any single coefficient can improve performance, the performance gains are inherently limited when scaling only one dimension.


What is Compound Scaling?

For high-resolution images, increasing the network depth is crucial to obtain a larger receptive field.

Additionally, to extract refined features from high-resolution images, it is necessary to increase the network width as well.

Due to this interdependency, the authors argue that scaling only one of the coefficients (d, w, r) is not sufficient and that a balanced scaling approach is required.

The experimental results (Fig.3) demonstrate how performance changes when width is varied while keeping depth and resolution fixed.

Furthermore, the results show that networks achieve greater performance improvements when both depth and resolution are increased simultaneously.


 

Therefore, it becomes evident that balancing the d, w, and r coefficients is crucial for effective ConvNet scaling.

Although previous attempts have been made to balance scaling, they often required numerous manual adjustments, making real-world application challenging.

To address this, the authors propose the compound scaling method, which uses a compound coefficient to uniformly adjust network width, depth, and resolution, ensuring a more systematic and efficient scaling process.



In the equation above, ฯ€ (pi) is a user-defined coefficient that determines the amount of computational resources to be used. The variables ฮฑ (alpha), ฮฒ (beta), and ฮณ (gamma) are parameters found through small grid search.

Notably, the FLOPS (Floating Point Operations per Second) of a convolution operation increases proportionally to d, w², and r².

Since convolution operations dominate ConvNet computations, the total FLOPS of a ConvNet can be expressed as being proportional to (ฮฑ * ฮฒ² * ฮณ²)^ฯ€.

By restricting the value of ฮฑ * ฮฒ² * ฮณ² to 2, the total FLOPS approximately scales proportionally to 2^ฯ€, ensuring a controlled and efficient computational scaling process.




EXPERIMENTS

The table below compares EfficientNet with existing ConvNets that achieve similar Top-1 and Top-5 accuracy.

Across all categories, EfficientNet consistently outperforms other models by requiring significantly fewer parameters and FLOPS.

  • EfficientNet uses up to 8.4× fewer parameters
  • EfficientNet requires up to 16× fewer FLOPS

This demonstrates EfficientNet’s superior efficiency in achieving high accuracy with minimal computational cost.




The results below illustrate Class Activation Maps (CAMs) generated under two different scaling approaches:
  1. Adjusting depth, width, and resolution individually
  2. Applying compound scaling, which balances depth, width, and resolution together

The findings show that compound scaling enables more effective feature activation in semantically meaningful regions of the image.

This confirms that compound scaling enhances the network’s ability to focus on important features, leading to better interpretability and improved model performance.




The table below presents the FLOPS and Top-1 Accuracy for the experimental networks used in the previous figure, categorized by different depth (d), width (w), and resolution (r) conditions.

The results clearly demonstrate that applying compound scaling yields better performance even with similar FLOPS, highlighting its efficiency and effectiveness in model scaling




























Comments