發現之前專案(NHtmlUnit)可以爬的網頁竟然會彈跳java.net.SocketException: 'Connection reset',這問題讓我想不出所以然。
![]() |
| java.net.SocketException: 'Connection reset' |
搜尋了這篇(文章)討論區的結果後發現,可能的問題是出在NHtmlUnit預設沒有設定TLS的版本
NHtmlUnit.WebClient client = new NHtmlUnit.WebClient(BrowserVersion.CHROME);
client.Options.AppletEnabled = true;
client.Options.RedirectEnabled = true;
client.Options.JavaScriptEnabled = true;
client.Options.ActiveXNative = true;
client.Options.CssEnabled = true;
client.Options.ThrowExceptionOnScriptError = false;
client.Options.ThrowExceptionOnFailingStatusCode = false;
client.WaitForBackgroundJavaScript(1000);
client.Options.Timeout = 100000;
HtmlPage page = client.GetHtmlPage("https://dictionary.cambridge.org/");
因此在程式碼中,加入TLS協定,就可以正常爬站了。
client.Options.SSLClientProtocols = new String[] { "TLSv1.2", "TLSv1.1" };
完整的程式碼如下:
NHtmlUnit.WebClient client = new NHtmlUnit.WebClient(BrowserVersion.CHROME);
client.Options.AppletEnabled = true;
client.Options.RedirectEnabled = true;
client.Options.JavaScriptEnabled = true;
client.Options.ActiveXNative = true;
client.Options.CssEnabled = true;
client.Options.ThrowExceptionOnScriptError = false;
client.Options.ThrowExceptionOnFailingStatusCode = false;
client.Options.SSLClientProtocols = new String[] { "TLSv1.2", "TLSv1.1" };
client.WaitForBackgroundJavaScript(1000);
client.Options.Timeout = 100000;
HtmlPage page = client.GetHtmlPage("https://dictionary.cambridge.org/");

沒有留言:
張貼留言