I'm a member of the copyleft FAT Lab and lead R&D at Rocketboom, where I've created Know Your Meme and Mag.ma
I also teach the Internet Famous Class at Parsons, where your grade depends on your online popularity.
Selected press:
NBC,
TIME [2],
CurrentTV,
Gawker,
Mashable,
TechCrunch,
BuzzFeed
,
ArtNews

Bootstrap your career in data hacking! With Ruby and WWW::Mechanize you can get started collecting data on the web with just a few lines of code.
download Jdubs’ mechanize scrapers 1.0 — simple scraping examples for MySpace, YouTube, and torrent index BTjunkie.
Techniques for exploring a web page, Ruby & gem installation, and explanations of the simple extractors below.
Need to install Ruby, or the mechanize gem? See Installing Ruby

“span.viewCount” is the video’s view count. That was easy.
Some simple data collectors with file downloading to get you started.
Execute them on the command line with “ruby myspace.rb“, or in irb
myspace.rb
find the top 20 friends for a given profile, then download all those people’s thumbnails
agent = WWW::Mechanize.new
agent.get("http://myspace.com/graffitiresearchlab")
links = agent.page.search('.friendSpace img') # found w/ firebug
FileUtils.mkdir_p 'myspace-images' # make the images dir
links.each_with_index { |link, index|
url = link['src']
puts "Saving thumbnail #{url}"
agent.get(url).save_as("myspace-images/top_friend#{index}_#{File.basename url}")
}
youtube.rb
get the most viewed YouTube videos via the gdata API... and download all of their thumbnails
agent = WWW::Mechanize.new
url = "http://gdata.youtube.com/feeds/api/standardfeeds/most_viewed" # all time
page = agent.get(url)
# parse again w/ Hpcricot for some XML convenience
doc = Hpricot.parse(page.body)
# pp (doc/:entry) # like "search"; cool division overload
images = (doc/'media:thumbnail') # use strings instead of symbols for namespaces
FileUtils.mkdir_p 'youtube-images' # make the images dir
urls = images.map { |i| i[:url] }
urls.each_with_index do |file,index|
puts "Saving image #{file}"
agent.get(file).save_as("youtube-images/vid#{index}_#{File.basename file}")
end
btjunkie.rb
download all the .torrent files on the front page
agent = WWW::Mechanize.new
agent.get("http://btjunkie.org/")
links = agent.page.search('.tor_details tr a')
hrefs = links.map { |m| m['href'] }.select { |u| u =~ /\.torrent$/ } # just links ending in .torrent
FileUtils.mkdir_p('btjunkie-torrents') # keep it neat
hrefs.each { |torrent|
filename = "btjunkie-torrents/#{torrent[0].split('/')[-2]}"
puts "Saving #{torrent} as #{filename}"
agent.get(torrent).save_as(filename)
}
More code:
Further reading:
The mechanize docs have examples of filling out and submitting forms, e.g. for logging in or searching.
If you write any fun scrapers or bots with these let me know.
Commenting is closed for this article.