Scripting Readability and Markdownify for clipping web pages
I wanted to share a handy tool that I realized I use daily but rarely talk about. I call it Read2Text, but it’s really just a Frankenstein script which combines Python Readability (license) with html2text (license). The combination allows you to grab web pages, process them with a port of Arc90’s Readability and convert the HTML to Markdown, ready for pasting or piping to a text file.
nvALT has this built in, but it’s been a little crashy lately. I find it more reliable to just do this from the command line. If you install it in your path (both the read2text
script and the “readability” folder), you can run read2text http://brettterpstra.com/keybinding-madness/ | pbcopy
.
You’ll get a Markdown-ified version of the page, with links, image links, headers, code blocks and text intact, but no comments, sidebars, ads, etc. It’s not perfect, but it does a solid job and cleanup only takes me a minute, even on huge sites. I use this most of the time instead of clipping to Evernote these days.
I alias it in my .bash_profile to rtt
, and often redirect the output straight to a text file in my nvALT folder: rtt http://grml.org/zsh/zsh-lovers.html > ~/Dropbox/Notes/nvALT2.1/zsh\ lovers.md
Now I have a new note that automatically shows up in nvALT with the text of the zsh-lovers page (yeah, I tried switching to zsh this morning. I’ll have to come back to that). Anyway, I thought others might find this hack of use, so I’m making the download available below.
Gather CLI v2.1.6
A Frankenstinian combination of html2text and Arc90 Readability. This command line tool makes clipping web pages into Markdown text without ads and comments simple.
Published 01/04/12.
Updated 09/18/23. Changelog
By the way, I also have a web service for this. You can get raw markdown or a nice interface for previewing and copying. There’s also an API and bookmarklets for integration into your favorite browser. Have fun!