(ICML 2019) EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

EfficientNet is a highly influential paper that has gained significant attention in the field of image classification due to its outstanding performance.

For projects requiring extensive training time or computational resources, EfficientNet serves as a valuable approach to enhancing ConvNet performance. It provides an efficient and scalable method for training convolutional neural networks while optimizing accuracy and computational cost, making it highly applicable for real-world AI deployment.

🔗 Research Paper: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

📌 Key Resources & Reviews

📖 Paper Review Summaries:

💻 Source Code (PyTorch Implementation):

GitHub: EfficientNet-PyTorch

ABSTRACT

CNN은 한정된 자원 내에서 개발되어왔으며, 자원이 한도 내에서 더 높은 정확도를 위해서 그 크기를 키워가는 방향으로 발전되어왔다.

이 논문에서는, model scaling에 대해 더 명확히 밝혀내기 위해 연구하게 되며, network의 depth, width, 그리고 resolution사이의 관계에 대한 균형을 맞춰야 더 나은 성능을 보인다는것을 체계적으로 밝혀낸다.

저자 제안

'Compound coefficient'

depth, width, resolution의 dimension들을 간단하면서도 높은 효율을 보이는 새로운 sacaling방법으로, MobileNet과 ResNet에 이 방법을 적용시켜봄으로써 효율성을 테스트한다.

EfficientNet

'Neural Architecture Search(NAS, 강화학습 기반으로 최적의 network를 찾는 방법)'를 사용하여 baseline network를 설계하였으며 이 baeline network를 scale up 한 가족 모델인 하였다.
EfficientNet-B7 : ImageNet dataset에 대해 84.4%(top-1 acc)/ 97.1%(top-5 acc)를 얻었을 정도로 매우 좋은 성능을 보이는데, 이는 최신 ConvNet보다 8.4배 작으며, 6.1배 빠른 성능이다.

INTRODUCTION

Several common techniques are used to enhance ConvNet performance:

🔹 Scaling Up Models by Increasing Layers

Increasing the number of layers has been a widely adopted method.
ResNet (He, 2016) improved accuracy by scaling from ResNet-18 to ResNet-200 through deeper layers.
GPipe (Huang, 2018) scaled a baseline model 4x larger and achieved 84.3% top-1 accuracy on ImageNet.
However, an optimal scaling approach for ConvNets remains poorly understood.

🔹 Alternative Scaling Methods

Increasing depth (more layers) or width (more channels per layer) are the most common scaling approaches.
Adjusting a model’s input image resolution is a lesser-known but increasingly popular technique.
Some approaches attempt to combine multiple scaling dimensions, but historically, most methods have focused on modifying only one at a time.

Author’s Motivation

🔹 Compound Scaling Method

"Is there a theoretically grounded method to scale up ConvNets for better performance?"
The authors argue that balancing depth, width, and resolution is critical for performance improvement.
They demonstrate that this balance can be determined using simple constant ratios.
Unlike previous heuristic approaches, EfficientNet scales network dimensions uniformly, rather than making arbitrary adjustments to individual factors.

🚀 Conclusion:
The EfficientNet scaling approach introduces a principled method for improving ConvNet performance, leading to better efficiency and accuracy compared to conventional scaling practices.

For example, if we want to design a model that is 2^N times larger, we scale the baseline network by adjusting:

Depth by a^N,
Width by b^N,
Image size by c^N.

A small grid search is conducted to find the optimal a, b, and c values that satisfy these conditions.

The figure below illustrates the concept of compound scaling.

Intuitively, compound scaling is expected to yield good results because:

When the input image size increases, the network needs to acquire a larger receptive field to capture a broader area.
More channels are required to extract refined patterns effectively.

Additionally, in this paper, the authors quantitatively analyze, for the first time, the relationship between network width, depth, and resolution.

Since performance improvements from model scaling heavily depend on the baseline network, the authors use Neural Architecture Search (NAS) to establish an optimal baseline network.

COMPOUND MODEL SCALING

Model scaling refers to expanding the length, width, and resolution of an existing baseline network, rather than focusing on designing the optimal architecture from scratch, as seen in other ConvNet design approaches.

To narrow the design space, the authors propose a uniform scaling strategy where all layers are scaled proportionally.

This approach formulates an optimization problem aimed at maximizing accuracy within limited computational resources.

Mathematically, this can be expressed as follows:

F, L, H, W, and C are determined by the baseline network.
w, d, and r are the scaling coefficients applied to the network.

The most critical challenge is that the optimal scaling coefficients (d, w, r) are interdependent and are subject to different resource constraints.

As a result, conventional ConvNets have typically scaled only one of the following dimensions:

Depth (d): Increasing the number of layers (e.g., ResNet-100 → ResNet-1000)
Width (w): Expanding the number of channels per layer
Resolution (r): Enlarging the input image size from M×M to rM × rM

While increasing any single coefficient can improve performance, the performance gains are inherently limited when scaling only one dimension.

What is Compound Scaling?

For high-resolution images, increasing the network depth is crucial to obtain a larger receptive field.

Additionally, to extract refined features from high-resolution images, it is necessary to increase the network width as well.

Due to this interdependency, the authors argue that scaling only one of the coefficients (d, w, r) is not sufficient and that a balanced scaling approach is required.

The experimental results (Fig.3) demonstrate how performance changes when width is varied while keeping depth and resolution fixed.

Furthermore, the results show that networks achieve greater performance improvements when both depth and resolution are increased simultaneously.

Therefore, it becomes evident that balancing the d, w, and r coefficients is crucial for effective ConvNet scaling.

Although previous attempts have been made to balance scaling, they often required numerous manual adjustments, making real-world application challenging.

To address this, the authors propose the compound scaling method, which uses a compound coefficient to uniformly adjust network width, depth, and resolution, ensuring a more systematic and efficient scaling process.

In the equation above, π (pi) is a user-defined coefficient that determines the amount of computational resources to be used. The variables α (alpha), β (beta), and γ (gamma) are parameters found through small grid search.

Notably, the FLOPS (Floating Point Operations per Second) of a convolution operation increases proportionally to d, w², and r².

Since convolution operations dominate ConvNet computations, the total FLOPS of a ConvNet can be expressed as being proportional to (α * β² * γ²)^π.

By restricting the value of α * β² * γ² to 2, the total FLOPS approximately scales proportionally to 2^π, ensuring a controlled and efficient computational scaling process.

EXPERIMENTS

The table below compares EfficientNet with existing ConvNets that achieve similar Top-1 and Top-5 accuracy.

Across all categories, EfficientNet consistently outperforms other models by requiring significantly fewer parameters and FLOPS.

EfficientNet uses up to 8.4× fewer parameters
EfficientNet requires up to 16× fewer FLOPS

This demonstrates EfficientNet’s superior efficiency in achieving high accuracy with minimal computational cost.

The results below illustrate Class Activation Maps (CAMs) generated under two different scaling approaches:

Adjusting depth, width, and resolution individually
Applying compound scaling, which balances depth, width, and resolution together

The findings show that compound scaling enables more effective feature activation in semantically meaningful regions of the image.

This confirms that compound scaling enhances the network’s ability to focus on important features, leading to better interpretability and improved model performance.

The table below presents the FLOPS and Top-1 Accuracy for the experimental networks used in the previous figure, categorized by different depth (d), width (w), and resolution (r) conditions.

The results clearly demonstrate that applying compound scaling yields better performance even with similar FLOPS, highlighting its efficiency and effectiveness in model scaling

인공지능 서비스 - 챗봇, 사전에 충분한 지식을 전달하고 함께 학습 하기!

2/09/2021

Search This Blog

Maritime 4.0: Innovation Driven by AI, Data, and Cyber Security