Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah, parsing & cleaning up HTML will always be a heuristic that fails sometimes.

I'd highlight your point nr. 5. A very large percentage of articles use Wordpress or other website builders. Once you support a few large sites your coverage drastically increases.

With Unclutter I also found it helpful to have an automated fallback -- if the text content of the page reduces too much, it disables some of the content block methods one by one.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: