發現之前專案(NHtmlUnit)可以爬的網頁竟然會彈跳java.net.SocketException: 'Connection reset',這問題讓我想不出所以然。
java.net.SocketException: 'Connection reset' |
搜尋了這篇(文章)討論區的結果後發現,可能的問題是出在NHtmlUnit預設沒有設定TLS的版本
NHtmlUnit.WebClient client = new NHtmlUnit.WebClient(BrowserVersion.CHROME); client.Options.AppletEnabled = true; client.Options.RedirectEnabled = true; client.Options.JavaScriptEnabled = true; client.Options.ActiveXNative = true; client.Options.CssEnabled = true; client.Options.ThrowExceptionOnScriptError = false; client.Options.ThrowExceptionOnFailingStatusCode = false; client.WaitForBackgroundJavaScript(1000); client.Options.Timeout = 100000; HtmlPage page = client.GetHtmlPage("https://dictionary.cambridge.org/");
因此在程式碼中,加入TLS協定,就可以正常爬站了。
client.Options.SSLClientProtocols = new String[] { "TLSv1.2", "TLSv1.1" };
完整的程式碼如下:
NHtmlUnit.WebClient client = new NHtmlUnit.WebClient(BrowserVersion.CHROME); client.Options.AppletEnabled = true; client.Options.RedirectEnabled = true; client.Options.JavaScriptEnabled = true; client.Options.ActiveXNative = true; client.Options.CssEnabled = true; client.Options.ThrowExceptionOnScriptError = false; client.Options.ThrowExceptionOnFailingStatusCode = false; client.Options.SSLClientProtocols = new String[] { "TLSv1.2", "TLSv1.1" }; client.WaitForBackgroundJavaScript(1000); client.Options.Timeout = 100000; HtmlPage page = client.GetHtmlPage("https://dictionary.cambridge.org/");
沒有留言:
張貼留言