Hierarchical,Federated,Learning:,Architecture,Challenges,and,Its,Implementation,in,Vehicular,Networks

发布时间：2023-11-18 14:55:04 来源：网友投稿

YAN Jintao ,CHEN Tan ,XIE Bowen ,SUN Yuxuan ,ZHOU Sheng ,NIU Zhisheng

(1.Tsinghua University,Beijing 100084,China；
2.Beijing Jiaotong University,Beijing 100044,China)

Abstract: Federated learning (FL) is a distributed machine learning (ML) framework where several clients cooperatively train an ML model by exchanging the model parameters without directly sharing their local data.In FL,the limited number of participants for model aggregation and communication latency are two major bottlenecks.Hierarchical federated learning (HFL),with a cloud-edge-client hierarchy,can lever‐age the large coverage of cloud servers and the low transmission latency of edge servers.There are growing research interests in implementing FL in vehicular networks due to the requirements of timely ML training for intelligent vehicles.However,the limited number of participants in vehicular networks and vehicle mobility degrade the performance of FL training.In this context,HFL,which stands out for lower latency,wider coverage and more participants,is promising in vehicular networks.In this paper,we begin with the background and motivation of HFL and the feasibility of implementing HFL in vehicular networks.Then,the architecture of HFL is illustrated.Next,we clarify new issues in HFL and review several existing solutions.Furthermore,we introduce some typical use cases in vehicular networks as well as our initial ef‐forts on implementing HFL in vehicular networks.Finally,we conclude with future research directions.

Keywords: hierarchical federated learning;vehicular network;mobility;convergence analysis

Recently,the evolution of intelligent technologies gives rise to a wide range of emerging applications includ‐ing the Internet of Things (IoT),autonomous driving,and so on.While opening up new ways of life for us‐ers,these applications also produce numerous data scattered on mobile devices.Transmitting these data to a centralized server for traditional machine learning (ML) is no longer ca‐pable due to limited communication resources,tight latency requirements and stringent privacy concerns.As a result,fed‐erated learning (FL) is proposed as a distributed learning solu‐tion,where multiple mobile devices and a parameter server co‐operatively train an ML model by only exchanging the model parameters without directly sharing their local data.

In recent years,many works have been done to deal with the different challenges of FL[1].Among them,communication effi‐ciency is one of the most important issues[2].Many FL frame‐works consider the cloud server as the parameter server,but the communication between clients and the cloud server is in‐efficient and unpredictable.Federated edge learning (FEEL)[3],where the clients share the ML model parameters with edge servers,has been proposed to reduce communication latency.However,the edge servers in FEEL have limited coverage and the number of clients for FL training cannot meet the require‐ments,resulting in the degradation of training performance.Therefore,it is necessary to characterize the tradeoff between communication latency and training performance.

To deal with this issue,the concept of hierarchical FL (HFL) has been proposed[4–5],which leverages the large cover‐age of the cloud server and the high communication efficiency of the edge server.This architecture consists of one cloud server,multiple edge servers,and a multitude of clients.In HFL,the clients update their local parameters and send them to the edge servers for edge aggregations as conventional FL does.The difference is that after several rounds of edge aggre‐gations,multiple edge servers send their parameters to a cloud server for cloud aggregation,which allows more clients to be involved in the framework.Experimental results and theoreti‐cal analysis have shown that this client-edge-cloud FL archi‐tecture has higher convergence speed and less training time compared with the conventional framework[5].

During the last several years,FL has witnessed its potential in vehicular networks.The advantage of implementing FL in ve‐hicular networks is twofold.First,FL can satisfy the latency and privacy requirements that the applications in vehicular net‐works,such as trajectory planning and traffic flow optimization,call for.Second,intelligent vehicles have computation and com‐munication capabilities and can sample abundant data for train‐ing[6].There have been many papers on the implementation of FL in vehicular networks.In Ref.[7],an FL-based approach is proposed to allocate the power and resource for ultra-reliable low-latency communications in vehicular networks.In Ref.[8],FL is used to update an edge caching scheme for vehicular net‐works,which considers the cached content and vehicular mobil‐ity.Considering the computation and communication resources and local dataset of vehicles,the authors of Ref.[9] propose a joint vehicle selection and resource allocation scheme for FL training.In Ref.[10],the vehicle speed and position are taken into consideration and an optimization problem is formulated for resource allocation for FL.

However,implementing FL in vehicular networks may be more challenging than that in conventional wireless net‐works[11].First,the ML application in vehicular networks has more stringent requirements for latency.This is because ve‐hicles may leave the coverage of the central server due to mo‐bility before successfully uploading their updated local model to the server.Second,since the physical distances between ve‐hicles are much larger than those of humans or mobile de‐vices,the number of clients participating in model aggregation in vehicular networks is much lower than that in conventional wireless networks,which degrades the convergence perfor‐mance of FL.In this context,HFL stands out for its properties of lower latency,wider coverage,and more participants,inspir‐ing us to search for the possibility of implementing HFL in ve‐hicular networks.

There are some existing surveys and tutorials on FL,as shown in Table 1.In Ref.[1],a comprehensive survey of FL in wireless networks is provided,and research directions includ‐ing compression and sparsification,convergence analysis,wireless resource management,and FL training method design are presented.The authors of Ref.[2] focus on minimizing the communication and computation latency and introduce the concept of timely edge learning.The key challenges and solu‐tions to the timely issues are discussed.In Ref.[11],the con‐cept of FL is combined with mobile edge computing (MEC) and a comprehensive survey of FL and MEC is provided.In Ref.[12],the implementation of FL in vehicular networks is studied and the major challenges are analyzed from a learning and communication perspective.In this work,we provide a comprehensive review of HFL and explore the feasibility of implementing HFL in vehicular networks.

▼Table 1.Existing surveys on FL

The rest of this paper is organized as follows.The HFL ar‐chitecture is introduced in Section 2.In Section 3,we clarify the new issues and challenges in HFL compared with FL and provide a review of existing works dealing with these issues.In Section 4,we introduce the typical use cases of HFL in ve‐hicular networks.Section 4 concludes this paper and gives some future research directions in this field.

In a typical HFL system,a cloud,some edges and several clients collaboratively train an ML model.The cloud covers all the edges and each edge covers some of the clients.All of the participants initialize a model of the same parameters and perform cloud epochs.Each cloud epoch is composed of edge learning stages and a cloud aggregation stage.During the edge learning stage,each edge,together with clients under its cover‐age,trains the learning model in the way of FL for some itera‐tions.During the cloud aggregation stage,edges transmit their model parameters or gradients to the cloud.The cloud aggre‐gates the parameters or gradients to update the global model and broadcasts the global model to edges.The cloud epoch is repeated until the global model converges.The training proce‐dure is illustrated in Fig.1.

▲Figure 1.Architecture of a hierarchical federated learning system

Different from clients in FL,who are always connected to the same parameter server,those clients in HFL can be associ‐ated with different edges during training.First,in cellular net‐works,the coverage areas of cells are generally overlapping,so clients in the overlapped area of some edges can be associ‐ated with either of them.Second,in a scenario of wireless com‐munications,especially in vehicular networks,clients may be moving,which means they can step from the coverage of one edge into that of another during training,while generally stay‐ing in the range of the cloud.Therefore,in HFL,edges may need to reconstruct connections with clients at the beginning of each iteration.

To formulate the training procedure,we assume there areMclients andNedges and denotewm(t) as a client’sm-th local model parameters at thet-th local update.Assume the clients perform local updatest1before edge aggregation and the edges perform FL iterationst2before cloud aggregation.For clientm,given the loss functionFm(⋅),learning rateηand the set of cli‐ents that are associated with the same edge at thet-th local up‐date,the local model evolves as follows:

▲Figure 2.Timescale of hierarchical federated learning (HFL) training

Compared with FL,HFL brings many new research issues,both theoretically and practically.From the theoretical per‐spective,the convergence analysis for HFL is more complex because of the multi-layer architecture.From the practical per‐spective,the resource management strategies for HFL should not only focus on allocating the wireless resources under one server,but also arrange resources among different edge serv‐ers.Also,the popularity of HFL gives rise to many new consid‐erations,such as HFL with device-to-device (D2D) communi‐cations and mobility-aware HFL.We provide a survey on HFL based on these three categories: convergence analysis,re‐source management,and new considerations of HFL.Note that these three categories might overlap with each other.For instance,the convergence analysis results might be used to de‐sign the resource allocation strategy in some works.

3.1 Convergence Analysis

In FL,convergence analysis illustrates how different factors influence the FL training performance,and thus can be used as a guideline for FL system design.In HFL,the convergence analysis is more complex.For FL,the clients only perform lo‐cal updates before global aggregation.However,edge aggrega‐tion is conducted before global aggregation,which results in a loose bound of convergence analysis.Many works on conver‐gence analysis for HFL have been done.In Ref.[5],an HFL framework is proposed and the convergence analysis of this framework is provided.By investigating how the distributed weights deviate from the centralized sequence,the authors give an upper bound for the deviation.The results show how the edge and the cloud intervals influence the convergence performance for both convex and non-convex loss functions.Following this work,the authors of Ref.[13] provide a tighter convergence bound.In this work,model quantization is ad‐opted to improve communication efficiency,and the edge and cloud aggregation intervals are optimized based on the theo‐retical results to improve the training performance.The au‐thors of Ref.[14] assume a graph topology where each edge is considered as a node in the graph,and occasionally averages its model parameters with adjacent nodes in a decentralized manner.Furthermore,a probabilistic approach is adopted for analyzing local updates.Convergence analysis of this scenario is then provided,showing the influence of local iterations,edge epochs,cloud epochs,network topology and node hetero‐geneity on the convergence performance.Ref.[15] is the first work that takes both data heterogeneity and stochastic gradi‐ent descent into consideration for convergence analysis.By de‐noting client-edge and edge-cloud data divergence,data het‐erogeneity is connected to the convergence bound and a worstcase upper bound for convergence is provided.The conver‐gence bound shows that local aggregates accelerate the conver‐gence speed of the global model by a “sandwich” behavior.The results are also extended to the cases in which the group‐ing is random or there are more than three layers.

However,most of the above papers consider a static topol‐ogy.In vehicular networks,the mobility of clients may de‐grade model convergence,which should be taken into consid‐eration.The authors of Ref.[16] propose a mobility-aware HFL framework.First,the HFL framework with mobile clients is modeled by a Markov chain.Then,convergence analysis is provided,showing how user mobility influences training per‐formance.Based on the theoretical analysis,the local update mode and access scheme are modified to reduce the impact of client mobility.Experimental results illustrate that the pro‐posed scheme can outperform the baselines,especially when the data heterogeneity or user mobility is high or the number of users is small.

3.2 Resource Management

Resource management is an important issue in FL.It means how the communication bandwidth,power and computing re‐sources are allocated to clients under the coverage of one server.In HFL,there is more than one edge server and new is‐sues arise.

One new issue in HFL is edge association,which is defined to find which clients should be allocated to which edge server.In Ref.[17],a joint resource allocation and edge association problem is formulated under HFL.The authors first propose the architecture of HFL and an optimization problem that aims to minimize both latency and energy consumption.Then,this problem is decomposed into two subproblems: a resource allo‐cation problem and an edge association one.The resource allo‐cation problem is proved to be convex and the optimization value can be reached.The edge association problem is solved via an iterative global cost reduction adjustment method.Simulation results show that the proposed scheme can outper‐form the baselines in terms of FL training performance with low latency and energy consumption.The authors of Ref.[18] focus on the interactions and limited rationalities of the cli‐ents.A dynamic resource allocation and edge association prob‐lem is proposed based on the game theory in self-organizing HFL frameworks.The edge association problem is solved via a lower-level evolutionary game and the resource allocation problem is solved via an upper-level Stackelberg differential game.Experiments show that the proposed scheme can well suit the dynamics of the HFL system.In Ref.[19],the effect of data heterogeneity is taken into consideration.The model er‐ror and the latency for HFL are first analyzed,and the optimi‐zation problem of user association and resource allocation is then proposed under both independent identically distributed (i.i.d.) and non-i.i.d.settings.For the non-i.i.d.settings,the distance of data distribution is considered and a primal-dual algorithm is proposed to solve the problem.Simulation results show that under both i.i.d.and non-i.i.d.settings,the pro‐posed scheme can outperform the baselines in terms of latency and testing accuracy.

Other issues in HFL include aggregation interval and incen‐tive mechanism design.In Ref.[20],a joint resource alloca‐tion and aggregation interval control problem is proposed,aim‐ing to minimize the training loss and the latency.Convergence analysis is provided to show the dependency of the conver‐gence performance on the number of participants,the aggrega‐tion interval and training latency.Then,the original problem is decomposed into two subproblems.The resource allocation problem is proved to be convex and the optimal value can be reached.For the aggregation interval control problem,a round‐ing and relaxation approach is adopted.Experimental results show that the proposed scheme can reach lower latency and higher training performance compared with the baselines.In Ref.[21],a two-level joint incentive design and resource allo‐cation problem is proposed.At the lower level,the cluster se‐lection problem is formulated as an evolutionary game.At the upper level,the action of the cluster head is solved via a deep learning-based approach.Experiments show the robustness and uniqueness of the proposed scheme.

3.3 New Considerations of HFL

The popularity of HFL gives rise to many novel architec‐tures,such as HFL with device-to-device (D2D) communica‐tions.In Ref.[22],a multi-layer hybrid FL framework is pro‐posed.The authors first introduce the architecture of this new FL architecture,where there are more than three layers.In each layer,clients aggregate the model parameters via D2Dcommunications and then transmit the parameters to the up‐per layers.Convergence analysis is provided to derive an up‐per bound of this framework and a distributed control algo‐rithm is proposed to improve the convergence performance.Experiment results show that the proposed framework can uti‐lize the network resources more efficiently without loss of con‐vergence speed and testing accuracy.

▼Table 2.Summary of recent papers on HFL

There are many application scenarios that can benefit from the deployment of HFL in vehicular networks,such as autono‐mous driving,intelligent transportation systems and smart wireless communications.Recent studies on these scenarios have adopted FL as the training framework of AI models to ob‐tain advantages in higher convergence speed,lower energy consumption and better privacy protection[23].However,re‐search on applying HFL to vehicular networks is still in its in‐fancy,leaving a large room for further study.

In this section,we first introduce several typical use cases of ML in vehicular networks,showing the great potential of HFL.Then we analyze the challenges and opportunities of HFL caused by mobility in vehicular networks.Finally,we show our own work on the implementation of HFL in vehicular networks,taking into account the mobility aspect.

4.1 Typical Use Cases

1) Autonomous driving:Autonomous driving is one of the key technologies in future vehicular networks.Trajectory pre‐diction and path planning are two necessary capabilities of au‐tonomous driving vehicles.To avoid collision with pedestri‐ans,vehicles and other traffic agents,autonomous driving ve‐hicles must reliably predict the future trajectories of surround‐ing agents and safely and efficiently plan their own future driving paths[24].Their decisions are based on the sensing data from onboard cameras,Lidars,GPS,and map informa‐tion.To meet the stringent latency and precision require‐ments,ML algorithms have been applied for these two tasks[25–26],which perform better than traditional approaches.However,the traffic environments of vehicular networks vary all the time as they keep driving,which requires vehicles to continually update their ML models with the latest data gener‐ated by sensors.HFL is more promising to provide welltrained and up-to-date ML models over centralized ML or con‐ventional FL,since HFL can utilize much more training data generated from a large number of vehicles driving in various areas,which can improve the adaptability of ML models to dy‐namic environments.

2) Intelligent transportation systems (ITS):ITS are novel traffic systems that utilize advanced information technologies to reduce traffic congestion,accident rate,energy consump‐tion and carbon emissions,and thus enhance efficiency,safety,reliability and eco-friendliness[27].Many typical appli‐cations of ITS are critical to future vehicular networks,such as collaborative perception and vehicle platooning.Collaborative perception,where data from multiple traffic agents are col‐lected and fused to conduct object detection,can achieve higher accuracy and precision than single-vehicle percep‐tion[28].Vehicle platooning,where a coordinated group of au‐tonomous vehicles travels collectively,can achieve faster and safer autonomous driving with shorter spacing than singlevehicle traveling[29].Existing research[28,30]on these use cases also considers applying machine learning methods to achieve better performance.Note that the ML models for ITS tasks usu‐ally require vehicles to share data,and the data,such as photo‐graphs and videos,can be private and sensitive.However,the centralized ML needs to collect the raw data from all vehicles to train an ML model,which leads to heavy communication burdens,as well as privacy problems.To reduce the unneces‐sary raw data transmission and the resulting privacy leakage,HFL is a promising paradigm of model training in ITS,since it only collects the lightweight gradient data,rather than the heavyweight and private raw data.

3) Smart wireless communications:In smart wireless com‐munications,ML algorithms are utilized in many wireless com‐munication tasks,such as multiple‐input,multiple‐output (MIMO) beam selection[31],channel modeling and estima‐tion[32],and joint source-channel coding[33].Compared to tradi‐tional wireless communications,ML algorithms designed and exploited for smart wireless communications can decrease communication overhead,improve the signal-to-noise ratio (SNR),and save transmission power,with much lower latency and fewer computing resources.Similar to the use cases of au‐tonomous driving,it is a challenge for ML models to adapt to the dynamic characteristics of channel states in vehicular net‐works.Therefore,HFL is also a promising training approach for smart wireless communications.

Although the use cases aforementioned have taken ML into account,there are few papers applying HFL to train the ML models for these scenarios in vehicular networks.Actually,HFL can exploit the data and computing resources of more ve‐hicles,and thus train ML models more efficiently than central‐ized ML.Compared to FL,vehicles from larger areas can bring richer data features to the training of HFL,which im‐proves the robustness of ML models.Therefore,it is promising to further study the application of HFL in vehicular networks.

4.2 Challenges and Opportunities with Vehicle Mobility

Despite the promising potential of applying HFL to vehicu‐lar networks,some properties of vehicular networks may stand as great barriers,in particular the mobility.Unlike other FL scenarios where clients stay in the same place or move at a low speed,intelligent vehicles usually travel fast on road,es‐pecially when they drive on the highway.This brings more dy‐namics and uncertainties to the topology of vehicles,leading to a change of association between vehicles and edges.First,vehicles may leave the coverage of an edge when uploading its model parameters while transmitting model parameters,or even before finishing one round of local updates,leading to a waste of communication and computation resources as well as leakage of training data.Second,the varying channel condi‐tions of vehicular communication links and the Doppler effect caused by vehicle mobility may result in the failure of model transmission or transmission errors in the received param‐eters,which also influences the FL training performance.

However,there are also chances brought by mobility.On the one hand,the mobility of vehicles creates more opportuni‐ties to meet[6]other vehicles,inspiring the leverage of vehicleto-vehicle (V2V) communications through side links to com‐pensate for the loss of changing edge and also accelerate the speed of edge aggregation.On the other hand,since the hierar‐chical structure of HFL brings a wide coverage,even though vehicles step out of the coverage of an edge,there’s a great chance that they still stay in the range of the cloud,so their data can still be used by training.What’s more,due to the het‐erogeneity of clients and the dynamic nature of the road envi‐ronment,data distribution generally varies from one edge to another.Mobility of vehicles promotes data fusion of edges and thus reduces data heterogeneity,which helps the global training model to converge faster.In the following section,we will give two case studies as examples of leveraging these op‐portunities.

4.3 Case Study 1: V2V-Assisted Hierarchical Federated Learning

In this case study,we propose a V2V-assisted hierarchical federated learning (VAHFL) framework,where the V2V com‐munication is utilized to speed up the aggregation process.In this framework (Fig.3),the uploading of model parameters in‐cludes both vehicle-to-infrastructure (V2I) and V2V communi‐cation.Some vehicles act as relay nodes that help other ve‐hicles with parameter transmission.Vehicles leaving the cover‐age of the central server can transmit their model parameters to the nearby relay nodes via the V2V link before it leaves,while vehicles near the server directly transmit its parameter to the server via the V2I link.We formulate a communication latency minimization problem by optimizing the uploading strategy,and a graph neuron network-reinforcement learning (GNN-RL) based algorithm is designed to solve this problem.

An experimental platform is built based on Simulation of Urban Mobility (SUMO) to evaluate the proposed framework,where there is one cloud server,four edge servers and 200 ve‐hicles.The vehicles move over time according to the Manhat‐tan mobility model.The vehicles cooperatively train a convolu‐tional neural network (CNN) model for an image classification task using the CIFAR-10[34]dataset.The V2I bandwidth is set to 30 MHz,and the V2V bandwidth is set to 10 MHz.For the benchmark,we consider that the vehicles directly transmit their model parameters to the server.Fig.4 illustrates that the proposed framework can reduce transmission latency by 41.54% and increase the percentage of successful transmis‐sions by 10.97%.

4.4 Case Study 2: Edge-Heterogeneous Hierarchical Fed⁃erated Learning

In this case study,we investigate the influence of mobility when training data of edge servers are heterogeneous.Before training,vehicles sample data to form local datasets.Data dis‐tribution is dependent on the location of vehicles,which means vehicles under the coverage of the same edge server sample from the same distribution,while vehicles from the coverage of different edge servers sample differently.There‐fore,at the start of training,the data distribution of edges is heterogeneous.During training,vehicles constantly travel across edges,driving the data from different edge servers to mix up.We analyze the convergence speed of this edgeheterogeneous HFL system and prove that mobility acceler‐ates convergence by promoting data fusion.

▲Figure 3.Schematic of V2V-assisted hierarchical federated learning framework

▲Figure 4.Latency and the percentage of successful transmission of the proposed scheme and baseline

Experiments are also conducted based on SUMO.We as‐sume one cloud server,four edge servers and 32 vehicles coop‐eratively train a four-layer CNN on the CIFAR-10 dataset,and we only choose data of eight classes from 10 classes for train‐ing and inference.Initially,each edge has data of two classes,which is uniformly distributed in vehicles under the coverage of the edge.During training,vehicles travel by the Manhattan mobility model,with their local datasets unchanged,which leads to changes in edge data distribution.The network is trained on three settings of vehicle mobility: no mobility,low mobility and high mobility.As Fig.5 shows,mobility in‐creases the convergence speed and final test accuracy of HFL.What’s more,when vehicles are moving,a higher vehicle speed results in a faster convergence speed.As is shown by the dashed line and stars in the figure,if we set the target test accuracy as 0.75,the low mobility and high mobility scenario reduces the training epochs by 40.6% and 51.9% separately.

This paper presents an overview of HFL and its application in vehicular networks.First,we introduce the background and motivation of HFL and the possibility of implementing it in ve‐hicular networks.Then,the architecture of HFL is presented.Afterward,we discuss new issues and challenges of HFL com‐pared with FL and review existing solutions.Furthermore,some typical use cases in vehicular networks are introduced and our existing works of implementing HFL in vehicular net‐works are presented.Apart from the works mentioned above,there are still some challenges and research directions for HFL and its implementation in vehicular networks:

1) Heterogenous vehicular networks: For HFL in vehicular networks,the participants may be more than just vehicles.Mo‐bile devices and other transportation infrastructures can also participate in model aggregation.In such a case,the network is heterogeneous,i.e.,the computing capability,the communi‐cation capacity and the mobility patterns of clients in this net‐work are quite different.This brings challenges to FL system design and resource management strategy.

▲Figure 5.Maximum achievable test accuracy of cloud model with dif⁃ferent mobility

2) Variation of channel conditions: Due to the high mobility of vehicles,the channel conditions of vehicular communica‐tion links may vary rapidly.This may result in the failure of model transmission or transmission errors in the received pa‐rameters.Therefore,the communication system should be carefully designed to prevent such cases.

3) Exploration of benefits of mobility: Usually,mobility is considered a bottleneck for FL implementation and training.However,mobility may also be explored to enhance FL train‐ing performance.In our initial efforts,the convergence speed of an edge-heterogeneous HFL is shown to be enhanced by the data fusion brought by vehicle mobility.Apart from that,other benefits of utilizing vehicle mobility are also worth be‐ing explored.

推荐访问:learning Architecture Hierarchical