本可避免的P1事故:Nginx變更導致(zhi)網關請求均響應400
問題背景
項目上(shang)(shang)使(shi)用SpringCloudGateway作為(wei)網(wang)關(guan)(guan)承接公網(wang)上(shang)(shang)各個業(ye)務線進(jin)來(lai)的(de)請求流(liu)量,在網(wang)關(guan)(guan)的(de)前面(mian)有(you)兩臺Nginx反向(xiang)代理(li)了網(wang)關(guan)(guan),網(wang)關(guan)(guan)做(zuo)了一系列(lie)的(de)前置處(chu)理(li)后(hou)轉發請求到(dao)后(hou)面(mian)各個業(ye)務線的(de)服務,簡要(yao)的(de)網(wang)絡鏈路為(wei):
網關域名(wmg.test.com) -> ... -> Nginx ->F5(硬負載域名fp.wmg.test) -> 網關 -> 業務系統
某(mou)天,負責運維Nginx的團隊要增(zeng)加兩(liang)臺新的Nginx機器(qi),原因說來話長(chang),按(an)下不表,使用兩(liang)臺新的Nginx機器(qi)替代掉原先反(fan)向代理網關(guan)的兩(liang)臺Nginx。
SRE等級定性P1
一(yi)個月(yue)黑風高的(de)夜晚(wan),負責運維(wei)Nginx的(de)團(tuan)(tuan)隊進(jin)行了(le)(le)生(sheng)產變更(geng),在兩臺(tai)新機(ji)器(qi)上部署(shu)了(le)(le)Nginx,然后(hou)讓網(wang)(wang)絡團(tuan)(tuan)隊將網(wang)(wang)關(guan)域名的(de)流(liu)量(liang)切換(huan)到了(le)(le)兩臺(tai)新的(de)Nginx機(ji)器(qi)上,剛切換(huan)完,立(li)馬有業務(wu)(wu)線(xian)團(tuan)(tuan)隊的(de)人反應,過網(wang)(wang)關(guan)的(de)接口(kou)(kou)請求(qiu)都變成(cheng)400了(le)(le)。負責運維(wei)Nginx的(de)團(tuan)(tuan)隊又讓網(wang)(wang)絡團(tuan)(tuan)隊將網(wang)(wang)關(guan)域名流(liu)量(liang)切回到原有的(de)兩臺(tai)Nginx上,業務(wu)(wu)線(xian)過網(wang)(wang)關(guan)的(de)接口(kou)(kou)請求(qiu)恢復正常,持續(xu)了(le)(le)兩分多鐘,SRE等級定性P1。
負責運維Nginx的團隊說,兩(liang)臺(tai)新的Nginx配(pei)置和(he)原有(you)的兩(liang)臺(tai)Nginx配(pei)置一樣,看不出什(shen)么問題,找到我(wo),讓我(wo)從網關排查有(you)沒有(you)什(shen)么錯誤日志。
不(bu)太可能吧,如果(guo)新(xin)的(de)兩臺Nginx配(pei)置和原有(you)的(de)兩臺Nginx配(pei)置一樣的(de)話,不(bu)會(hui)出現請求都是(shi)400的(de)問題啊,我心想,不(bu)過還是(shi)去看了網(wang)關上的(de)日志,在那個時(shi)間段(duan),網(wang)關沒有(you)錯誤(wu)日志出現。
看了下新Nginx的(de)日(ri)志(zhi)(zhi),Options請求正常返回204,其它的(de)GET、POST請求都(dou)是400,Options是預(yu)檢請求,在(zai)Nginx層面就處理返回了,新Nginx的(de)日(ri)志(zhi)(zhi)示例如下:
10.x.x.x:63048 > - > 10.x.x.x:8099 > [2025-07-17T10:36:26+08:00] > 10.x.x.x:8099 OPTIONS /api/xxx HTTP/1.1 > 204 > 0 > //domain/ > Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 > - > [req_time:0.000 s] >[upstream_connect_time:- s]> [upstream_header_time:- s] > [upstream_resp_time:- s] [-]
10.x.x.x:63048 > - > 10.x.x.x:8099 > [2025-07-17T10:36:26+08:00] > 10.x.x.x:8099 POST /api/xxx HTTP/1.1 > 400 > 0 > //domain/ > Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 > - > [req_time:0.001 s] >[upstream_connect_time:0.000 s]> [upstream_header_time:0.001 s] > [upstream_resp_time:0.001 s] [10.x.x.x:8082]
去找了(le)網(wang)絡團(tuan)隊,從(cong)流量(liang)(liang)回(hui)溯(su)設備上(shang)看到400確實(shi)是網(wang)關返回(hui)的(de)(de),還沒有到后面的(de)(de)業務系統(tong),400代表BadRequest,我(wo)懷疑是不是請(qing)求體的(de)(de)問題,想讓網(wang)絡將那個時間段的(de)(de)流量(liang)(liang)包(bao)數據取下來分析,網(wang)絡沒給,只給我(wo)了(le)業務報文參數,走網(wang)關請(qing)求的(de)(de)業務參數報文是加(jia)密的(de)(de),我(wo)本地(di)運行程(cheng)序可以(yi)正常解密報文,我(wo)反饋給了(le)負責運維Nginx的(de)(de)團(tuan)隊。
負責運維Nginx的(de)團隊(dui)又花了一段時間定(ding)位問題,還(huan)是沒有頭緒,又找到我,讓我幫忙分析調查下。
介入調查
我(wo)說(shuo)測(ce)試環境地(di)址是啥,我(wo)先在測(ce)試環境看下能不能復現,負責運維Nginx的團隊(dui)成員說(shuo),沒有在測(ce)試環境搭建(jian)測(ce)試,這一(yi)次變更是另一(yi)個成員直接生(sheng)產變更。
??
我要來了新的Nginx配置文件(jian)和(he)老(lao)的Nginx配置文件(jian)比(bi)對了下(xia),發(fa)現有不一(yi)樣的地方,老(lao)Nginx上反向(xiang)代理網關(guan)的配置如(ru)下(xia):
server {
listen 8080;
server_name wmg.test.com;
add_header X-Frame-Options "SAMEORIGIN";
add_header X-Content-Type-Options "nosniff";
add_header Content-Security-Policy "frame-ancestors 'self'";
location / {
proxy_hide_header host;
client_max_body_size 100m;
add_header 'Access-Control-Allow-Origin' "$http_origin" always;
add_header 'Access-Control-Allow-Credentials' 'true' always;
add_header 'Access-Control-Allow-Methods' 'GET, POST, OPTIONS, DELETE, PUT';
add_header 'Access-Control-Allow-Headers' '...';
if ($request_method = 'OPTIONS') {
return 204;
}
proxy_pass //fp.wmg.test:8090;
}
}
新Nginx配置如下:
upstream http_gateways{
server fp.wmg.test:8090;
keepalive 30;
}
server {
listen 8080 backlog=512;
server_name wmg.test.com;
add_header X-Frame-Options "SAMEORIGIN";
add_header X-Content-Type-Options "nosniff";
add_header Content-Security-Policy "frame-ancestors 'self'";
location / {
proxy_hide_header host;
proxy_http_version 1.1;
proxy_set_header Connection "";
client_max_body_size 100m;
add_header 'Access-Control-Allow-Origin' "$http_origin" always;
add_header 'Access-Control-Allow-Credentials' 'true' always;
add_header 'Access-Control-Allow-Methods' 'GET, POST, OPTIONS, DELETE, PUT';
add_header 'Access-Control-Allow-Headers' '...';
if ($request_method = 'OPTIONS') {
return 204;
}
proxy_pass //http_gateways;
}
}
新Nginx代理網關的配置與原有Nginx上的配置區別在于:
-
使(shi)用(yong)upstream配置(zhi)了(le)網關的F5負載均衡地址:
upstream http_gateways{ server fp.wmg.test:8090; keepalive 30; } -
設置http協議為1.1,啟用長連接
proxy_http_version 1.1; proxy_set_header Connection "";
我讓負責運維Nginx的團隊在測試環(huan)境(jing)的Nginx上按照新的Nginx配置(zhi)模擬(ni)了生產環(huan)境(jing):
Nginx:10.100.8.11 監聽(ting)9104端口
網關:10.100.22.48 監聽8081端口
Nginx的(de)9104端口轉發到網關的(de)8081端口,配置(zhi)如下:
upstream http_gateways{
server 10.100.22.48:8081;
keepalive 30;
}
server {
listen 9104 backlog=512;
server_name localhost;
add_header X-Frame-Options "SAMEORIGIN";
add_header X-Content-Type-Options "nosniff";
add_header Content-Security-Policy "frame-ancestors 'self'";
location / {
proxy_hide_header host;
proxy_http_version 1.1;
proxy_set_header Connection "";
client_max_body_size 100m;
add_header 'Access-Control-Allow-Origin' "$http_origin" always;
add_header 'Access-Control-Allow-Credentials' 'true' always;
add_header 'Access-Control-Allow-Methods' 'GET, POST, OPTIONS, DELETE, PUT';
add_header 'Access-Control-Allow-Headers' '...';
if ($request_method = 'OPTIONS') {
return 204;
}
proxy_pass //http_gateways;
}
}
問題復現
通過Nginx請(qing)求網關到后端服務接口,問題復現(xian),請(qing)求響應400:
curl -v -X GET //10.100.8.11:9104/wechat-web/actuator/info
去(qu)掉下(xia)面的兩(liang)個配置,請求正常(chang)響應200:
proxy_http_version 1.1;
proxy_set_header Connection "";
天外來鍋
將這個現象反饋(kui)給了(le)負責(ze)(ze)運維Nginx的團隊,結果負責(ze)(ze)運維Nginx的團隊查了(le)半天說網關不支持長連接,要(yao)讓網關改造。
??
不應該啊,以往(wang)網關(guan)發(fa)版的(de)時候,是(shi)(shi)滾動發(fa)版的(de),F5上(shang)先下(xia)掉(diao)一個機器的(de)流量(liang),停啟這個機器上(shang)的(de)網關(guan)服務,然后F5上(shang)流量(liang),F5下(xia)流量(liang)的(de)時候是(shi)(shi)有(you)長連接存在的(de),每(mei)次都會等個5分鐘左右才能下(xia)掉(diao)一路的(de)流量(liang)。
得,先放(fang)下手(shou)頭的工作,花點時間來證明(ming)網關是支持(chi)長(chang)連接的。
在(zai)Nginx機器(qi)上通過命令行(xing)指(zhi)定長(chang)連接方式訪問(wen)網關請求后端服務接口:
wget -d --header="Connection: keepalive" //10.100.22.48:8081/wechat-web/actuator/info //10.100.22.48:8081/wechat-web/actuator/info //10.100.22.48:8081/wechat-web/actuator/info
回車出現如下日志:
Setting --header (header) to Connection: keepalive
DEBUG output created by Wget 1.14 on linux-gnu.
URI encoding = ‘UTF-8’
Converted file name 'info' (UTF-8) -> 'info' (UTF-8)
Converted file name 'info' (UTF-8) -> 'info' (UTF-8)
--2025-07-17 13:45:08-- //10.100.22.48:8081/wechat-web/actuator/info
Connecting to 10.100.22.48:8081... connected.
Created socket 3.
Releasing 0x0000000000c95a90 (new refcount 0).
Deleting unused 0x0000000000c95a90.
---request begin---
GET /wechat-web/actuator/info HTTP/1.1
User-Agent: Wget/1.14 (linux-gnu)
Accept: */*
Host: 10.100.22.48:8081
Connection: keepalive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
transfer-encoding: chunked
Content-Type: application/vnd.spring-boot.actuator.v3+json
Date: Thu, 17 Jul 2025 05:25:34 GMT
---response end---
200 OK
Registered socket 3 for persistent reuse.
Length: unspecified [application/vnd.spring-boot.actuator.v3+json]
Saving to: ‘info’
[ <=> ] 83 --.-K/s in 0s
2025-07-17 13:45:08 (7.75 MB/s) - ‘info’ saved [83]
URI encoding = ‘UTF-8’
Converted file name 'info' (UTF-8) -> 'info' (UTF-8)
Converted file name 'info' (UTF-8) -> 'info' (UTF-8)
--2025-07-17 13:45:08-- //10.100.22.48:8081/wechat-web/actuator/info
Reusing existing connection to 10.100.22.48:8081.
Reusing fd 3.
---request begin---
GET /wechat-web/actuator/info HTTP/1.1
User-Agent: Wget/1.14 (linux-gnu)
Accept: */*
Host: 10.100.22.48:8081
Connection: keepalive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
transfer-encoding: chunked
Content-Type: application/vnd.spring-boot.actuator.v3+json
Date: Thu, 17 Jul 2025 05:25:34 GMT
---response end---
200 OK
Length: unspecified [application/vnd.spring-boot.actuator.v3+json]
Saving to: ‘info.1’
[ <=> ] 83 --.-K/s in 0s
2025-07-17 13:45:08 (9.47 MB/s) - ‘info.1’ saved [83]
URI encoding = ‘UTF-8’
Converted file name 'info' (UTF-8) -> 'info' (UTF-8)
Converted file name 'info' (UTF-8) -> 'info' (UTF-8)
--2025-07-17 13:45:08-- //10.100.22.48:8081/wechat-web/actuator/info
Reusing existing connection to 10.100.22.48:8081.
Reusing fd 3.
---request begin---
GET /wechat-web/actuator/info HTTP/1.1
User-Agent: Wget/1.14 (linux-gnu)
Accept: */*
Host: 10.100.22.48:8081
Connection: keepalive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
transfer-encoding: chunked
Content-Type: application/vnd.spring-boot.actuator.v3+json
Date: Thu, 17 Jul 2025 05:25:34 GMT
---response end---
200 OK
Length: unspecified [application/vnd.spring-boot.actuator.v3+json]
Saving to: ‘info.2’
[ <=> ] 83 --.-K/s in 0s
2025-07-17 13:45:08 (11.1 MB/s) - ‘info.2’ saved [83]
FINISHED --2025-07-17 13:45:08--
Total wall clock time: 0.1s
Downloaded: 3 files, 249 in 0s (9.25 MB/s)
可以看到第一個請(qing)求建立(li)了socket 3,Connection: keepalive,請(qing)求成功,http響應狀態碼(ma)為200

第(di)二個請求(qiu)重用了第(di)一個連接,socket 3,Connection: keepalive,請求(qiu)成功,http響應狀(zhuang)態碼為200

第三(san)個(ge)請求依然(ran)重(zhong)用(yong)了第一(yi)個(ge)連(lian)接,socket 3,Connection: keepalive,請求成功,http響應狀態碼(ma)為200

網關是支持長連接的,反(fan)饋(kui)給負責運維(wei)Nginx的團隊(dui)(dui),負責運維(wei)Nginx的團隊(dui)(dui)又查(cha)了半天,又找到我說還是得拜托我來(lai)調查(cha)解決掉這個(ge)問(wen)題。
深度調查
在(zai)測試環(huan)境Nginx機器10.100.8.11上使用tcpdump命令抓(zhua)取與網關相關的流量包:
tcpdump -vv -i ens192 host 10.100.22.48 and tcp port 8081 -w /tmp/ng400.cap
找到出(chu)現http響應碼為400的(de)請(qing)求,可以(yi)看到流(liu)量包中的(de)wechat-web/actuator/info請(qing)求響應為:HTTP/1.1 400 Bad Request
觀察請求體,其中一個(ge)請求頭Host的值為(wei):http_gateways,這引(yin)起了(le)我的注意:

查閱資(zi)料得(de)到(dao),HTTP/1.1協議規范定(ding)義HTTP/1.1版(ban)本(ben)必須傳遞Host請求頭
- Both clients and servers MUST support the Host request-header.
- A client that sends an HTTP/1.1 request MUST send a Host header.
- Servers MUST report a 400 (Bad Request) error if an HTTP/1.1
request does not include a Host request-header.
- Servers MUST accept absolute URIs.
Host的格(ge)式可以(yi)包(bao)含(han):. 和 - 特(te)殊符號,_ 不被支持
查閱Nginx的官方文檔得知,proxy_set_header 有兩個默認配置:
proxy_set_header Host $proxy_host;
proxy_set_header Connection close;
可以看出Nginx啟用(yong)了HTTP/1.1協議,Host如果沒有指定會取$proxy_host,那么使用(yong)upstream的(de)情(qing)況下,$proxy_host就是(shi)upstream的(de)名稱(cheng),而此處(chu)的(de)upstream中包含_,不(bu)是(shi)合法的(de)Host格式。
HTTP/1.1規定必須傳遞(di)Host的(de)一方面原(yuan)因就是為了支持單IP地址托管多域(yu)名的(de)虛擬主機功(gong)能,方便(bian)后端(duan)服務根據不同來源Host做(zuo)不同的(de)處理(li)。
Older HTTP/1.0 clients assumed a one-to-one relationship of IP addresses and servers; there was no other established mechanism for distinguishing the intended server of a request than the IP address to which that request was directed. The changes outlined above will allow the Internet, once older HTTP clients are no longer common, to support multiple Web sites from a single IP address, greatly simplifying large operational Web servers, where allocation of many IP addresses to a single host has created serious problems.
那么只(zhi)要遵循了HTTP/1.1協議規范的(de)框(kuang)架(Tomcat、SpringCloudGateway、...)在解析Host時(shi)發現Host不是合法的(de)格式時(shi),就響(xiang)應(ying)了400。
本地搭(da)建了一(yi)個測試環境,debug了下網關(guan)的代碼,在SpringCloudGateway解(jie)(jie)析(xi)http請求類ReactorHttpHandlerAdapter中的apply方法(fa)里面可以看到,解(jie)(jie)析(xi)Host失敗會響應400:

下(xia)面是SpringCloudGateway解析http請求類ReactorHttpHandlerAdapter中(zhong)的apply方法邏輯:
public Mono<Void> apply(HttpServerRequest reactorRequest, HttpServerResponse reactorResponse) {
NettyDataBufferFactory bufferFactory = new NettyDataBufferFactory(reactorResponse.alloc());
try {
ReactorServerHttpRequest request = new ReactorServerHttpRequest(reactorRequest, bufferFactory);
ServerHttpResponse response = new ReactorServerHttpResponse(reactorResponse, bufferFactory);
if (request.getMethod() == HttpMethod.HEAD) {
response = new HttpHeadResponseDecorator(response);
}
return this.httpHandler.handle(request, response)
.doOnError(ex -> logger.trace(request.getLogPrefix() + "Failed to complete: " + ex.getMessage()))
.doOnSuccess(aVoid -> logger.trace(request.getLogPrefix() + "Handling completed"));
}
catch (URISyntaxException ex) {
if (logger.isDebugEnabled()) {
logger.debug("Failed to get request URI: " + ex.getMessage());
}
reactorResponse.status(HttpResponseStatus.BAD_REQUEST);
return Mono.empty();
}
}
SpringCloudGateway通過(guo)debug級(ji)別(bie)日志(zhi)(zhi)(zhi)輸出(chu)這(zhe)類不符合協議(yi)規范的日志(zhi)(zhi)(zhi),生產日志(zhi)(zhi)(zhi)級(ji)別(bie)為info,因此不會打印這(zhe)樣(yang)異常的日志(zhi)(zhi)(zhi)。
解決方案
既然HTTP/1.1協議規定(ding)必須傳(chuan)遞(di)Host且沒(mei)有通(tong)過(guo)配(pei)(pei)置顯式指(zhi)定(ding)Nginx傳(chuan)遞(di)的Host時Nginx會有默認值,那么在(zai)Nginx的配(pei)(pei)置中增(zeng)加傳(chuan)遞(di)Host的配(pei)(pei)置覆蓋默認值的邏(luo)輯,查閱Nginx的文檔,可(ke)以通(tong)過(guo)增(zeng)加下面(mian)的配(pei)(pei)置解決:
proxy_set_header Host $host;
在測(ce)試環境Nginx9104端口代(dai)理配(pei)置中增(zeng)加上面的配(pei)置,再次執行,請求正(zheng)常響應200。

完整配置如下:
upstream http_gateways{
server 10.100.22.48:8081;
keepalive 30;
}
server {
listen 9104 backlog=512;
server_name wmg.test.com;
add_header X-Frame-Options "SAMEORIGIN";
add_header X-Content-Type-Options "nosniff";
add_header Content-Security-Policy "frame-ancestors 'self'";
location / {
proxy_set_header Host $host;
proxy_hide_header host;
proxy_http_version 1.1;
proxy_set_header Connection "";
client_max_body_size 100m;
add_header 'Access-Control-Allow-Origin' "$http_origin" always;
add_header 'Access-Control-Allow-Credentials' 'true' always;
add_header 'Access-Control-Allow-Methods' 'GET, POST, OPTIONS, DELETE, PUT';
add_header 'Access-Control-Allow-Headers' '...';
if ($request_method = 'OPTIONS') {
return 204;
}
proxy_pass //http_gateways;
}
}
解決方案不止一個:
- 可以修改upstream的名稱,去掉不支持的_,比如更換為:http-gateways、httpgateways
- 還可以直接指定Host的值為域名(domain),proxy_set_header Host 'doamin';
總結
這(zhe)個問題(ti)只要(yao)在(zai)測(ce)試環境測(ce)試下(xia),是必現的(de),不屬于測(ce)試case沒(mei)有覆蓋到(dao)的(de)范疇(chou),一(yi)定要(yao)重(zhong)視測(ce)試流(liu)程(cheng),很多流(liu)程(cheng)看似繁瑣(suo),其實(shi)都是血與(yu)淚的(de)教訓得來的(de)。
本文來自博客園,作者:杜勁松,轉載請注明原文鏈接://www.ywjunkang.com/imadc/p/19002991
