2年前曾經有做過一個C# MVC專案要去單字網站爬下想背的單字來做資料庫,最近心血來潮也想重啟專案來使用(準備考TOEIC)。
發現之前專案(NHtmlUnit)可以爬的網頁竟然會彈跳java.net.SocketException: 'Connection reset',這問題讓我想不出所以然。
|
java.net.SocketException: 'Connection reset' |
搜尋了這篇(
文章)討論區的結果後發現,可能的問題是出在NHtmlUnit預設沒有設定TLS的版本
NHtmlUnit.WebClient client = new NHtmlUnit.WebClient(BrowserVersion.CHROME);
client.Options.AppletEnabled = true;
client.Options.RedirectEnabled = true;
client.Options.JavaScriptEnabled = true;
client.Options.ActiveXNative = true;
client.Options.CssEnabled = true;
client.Options.ThrowExceptionOnScriptError = false;
client.Options.ThrowExceptionOnFailingStatusCode = false;
client.WaitForBackgroundJavaScript(1000);
client.Options.Timeout = 100000;
HtmlPage page = client.GetHtmlPage("https://dictionary.cambridge.org/");
因此在程式碼中,加入TLS協定,就可以正常爬站了。
client.Options.SSLClientProtocols = new String[] { "TLSv1.2", "TLSv1.1" };
完整的程式碼如下:
NHtmlUnit.WebClient client = new NHtmlUnit.WebClient(BrowserVersion.CHROME);
client.Options.AppletEnabled = true;
client.Options.RedirectEnabled = true;
client.Options.JavaScriptEnabled = true;
client.Options.ActiveXNative = true;
client.Options.CssEnabled = true;
client.Options.ThrowExceptionOnScriptError = false;
client.Options.ThrowExceptionOnFailingStatusCode = false;
client.Options.SSLClientProtocols = new String[] { "TLSv1.2", "TLSv1.1" };
client.WaitForBackgroundJavaScript(1000);
client.Options.Timeout = 100000;
HtmlPage page = client.GetHtmlPage("https://dictionary.cambridge.org/");