Scrapy - Uploading image files to Amazon S3

Scrapy has a nice built in feature to automatically download and store images that it comes across while scraping a website. There’s some great documentation on how to get started. Theres also some undocumented code that allows you to store your images on Amazon S3. Here’s how to do that:

in your settings.py file:

ITEM_PIPELINES = {
    'scrapy.contrib.pipeline.images.ImagesPipeline': 1
}

# This is going to be the amazon s3 bucket. 
# You need to use the below format so Scrapy 
# can parse it. !!Important don't forget to add 
# the trailing slash.
IMAGES_STORE = 's3://my-bucket-name/'

IMAGES_EXPIRES = 180 # The amount of days until we re-download the image

IMAGES_THUMBS = {
    'small': (50, 50), # You can add as many of these as you want
    'big': (300, 300)
}

AWS_ACCESS_KEY_ID = 'your-access-key'
AWS_SECRET_ACCESS_KEY= 'your-secret-access-key'

For the sake of security I suggest creating a new user in the Amazon AWS interface and give that user only read/write privileges to your bucket.

Now we need to install a few packages that didn’t come by default with Scrapy:

pip install pillow
pip intall boto

Pillow handles the image manipulation and boto will provide the library that connects to S3.

Scrapy uses the image_urls key in your item to look for images it should download. This should be a list of image urls. Once downloaded Scrapy writes the details of the image location to the images key.

Don’t forget to add these to your items.py file:

class MyItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

Now don’t forget to actually populate the image_urls key during your crawl. Once you crawl your site final output will look something like this for a given item:

'image_urls': [u'http://example.com/images/tshirt.jpg'],
'images': [{ 'checksum': '264d3bbdffd4ab3dcb8f234c51329da8',
         'path': 'full/069f409fd4cdb02248d726a625fecd8299e6055e.jpg',
         'url': 'http://example.com/images/tshirt.jpg'}],

Now head on over to you amazon S3 bucket and have a look. Your images and thumbnails are all there!