The Great Photo Escape: Freeing Images from Kodak's Digital Prison

25 Feb 2017

Starting in 2011, Kodak brought to market the Kodak Pulse line of digital photo frames. In addition to SD card and USB support, this line of photo frames had an email address which could receive image attachments, store the images on Kodak’s servers, and display the images hassle free on the digital photo frame. While this feature is a boon for people who like to receive photos from friends and family with minimal latency, there is one very important feature missing – the ability to download these images in bulk.

simulation

Figure 1: Steve McQueen as Captain Virgil Hilts in the 1963 film _The Great Escape_. Captain Virgil Hilts is one of many prisoners of war imprisoned in a high security POW camp during World War II. While I acknowledge that using this allusion is dangerously close to invoking Godwin's Law, the movie is a masterpiece that deserves the occasional mention.

While it was possible to manually download each image using a web browser, this is not an acceptable means for backing up images, particularly if your photo album contains thousands of photographs. Instead of trying to download all of these images manually, I decided that a programmatic solution must exist. To that end, I created a bulk image crawling script using the python library scrapy.

In order for my scraping plan to work, I needed to make sure that the website used provided some identifying information for each image. When viewing the photo album, the website defaults to showing a table of thumbnails. Each thumbnail is linked to a CloudFront cache of the image with a unique URL for each thumbnail.

<td align="center" valign="middle">
  <img
    src="http://d2e6lg84of5d93.cloudfront.net/26~t~8ab5b353-73d0-47ea-95a7-12b09521ce09~636155450392171000.jpg"
    alt="0"
    id="picture_8ab5b353-73d0-47ea-95a7-12b09521ce09"
    rot="RotateNone"
    style="display: inline;"
  />
  <canvas id="canvas_0" style="display:none;"></canvas>
</td>

When you hover over the image, the thumbnail performs an asynchronous call to load the full-size image, which is downloaded. In the case of the example image above, the full-size image was also stored in CloudFront

http://d2e6lg84of5d93.cloudfront.net/26~a~8ab5b353-73d0-47ea-95a7-12b09521ce09~636155450392171000.jpg

Initially, I thought that the similarity may have been a fluke, but as it happens the thumbnail and original images have nearly identical URLs; only the a/t characters change.

http://d2e6lg84of5d93.cloudfront.net/26~t~8ab5b353-73d0-47ea-95a7-12b09521ce09~636155450392171000.jpg
http://d2e6lg84of5d93.cloudfront.net/26~a~8ab5b353-73d0-47ea-95a7-12b09521ce09~636155450392171000.jpg

With this information in hand, I knew that I could iterate through each thumbnail on the page, extract the URL, change the t to an a and have a URL to the original image. From here, I started working on the scraper.

Setting up the initial boiler plate for the scraper was relatively straightforward. First, I created a virtual environment and installed scrapy.

virtualenv venv --python=python3
source venv/bin/activate
pip install scrapy

Then I created a typical Scrapy directory structure.

kodak-pulse-downloader
├── items.py
├── scrapy.cfg
├── settings.py
├── spiders
│   ├── KodakPulseSpider.py

This directory structure has four primary files. First, there is KodakPulseSpider.py, which contains all of my scraping logic. Next is items.py, which is a class file that would contain a Scrapy Item to represent the images I was trying to download. Lastly, I have scrapy.cfg and settings.py, which describe Scrapy configuration and spider configuration respectively.

In Scrapy, there are three main components:

Spiders, which parse the HTML content served,
Items, which are object wrappers for data being scrapped, and
Pipelines, which handle data cleansing and storage for each item parsed.

Let’s start by looking at the KodakPulseSpider.py file. The Scraper has three main steps. First, it must login to the Kodak Pulse website. Then, it must determine how many pages of images are available, and lastly, it must identify each image and formulate the canonical URL to the original image.

I began by creating a new scrapy.Spider class. By default, Scrapy wants to know the name of the spider, and the spider needs to know the URL to Kodak Pulse.

class KodakPulseSpider(scrapy.Spider):
    name = "KodakPulseSpider"
    base_url = "https://www.kodakpulse.com/"

Next, I created a start_request to initialize the spider with an authentication request. The base URL used a form POST to authenticate with the service, so the Scrapy FormRequest method was this leveraged to log in. This method uses a callback to continue with the scraping process when login has completed, so I created an intermediate logged_in method which navigates the spider to the picture viewing page.

class KodakPulseSpider(scrapy.Spider):
    name = "KodakPulseSpider"
    base_url = "https://www.kodakpulse.com/"
    self.page = 1

    def start_requests(self):
        """This function is called before crawling starts."""

        # Login
        return [
            scrapy.FormRequest(self.base_url,
                               formdata={
                                   'username': self.username,  # Set to the username
                                   'password': self.password  # Set to the password
                               },
                               method="post",
                               callback=self.logged_in)]

    def logged_in(self, response):
        return scrapy.Request(self.base_url + '/Frame/AllPictures')

With the basics of authentication out of the way, I implemented the spider’s aptly-named default parsing method, parse. I wanted this method to do two things. It must find, transform, and marshal all images with a CloudFront URL, and keep track of the page number.

def parse(self, response):
        images = response.css('img').xpath('@src').extract()
        image_urls = [image.replace('~t~', '~a~') for image in images
                        if "cloudfront" in image]
        yield ImageItem(image_urls=image_urls)

Finding all image URLs was a straightforward task using Scrapy’s xpath extraction API. Once a list of URLs was generated, I had the spider iterate through each URL verifying that the URL resolved to CloudFront, and transformed the thumbnail image into the original image URL. This URL was then used to create an ImageItem.

def parse(self, response):
        images = response.css('img').xpath('@src').extract()
        image_urls = [image.replace('~t~', '~a~') for image in images
                        if "cloudfront" in image]
        yield ImageItem(image_urls=image_urls)

        if len(images) < 48:
            return

        self.page += 1
        next_url = "{base}/Frame/AllPictures?p={page}".format(base=base_url,
                                                              page=self.page)
        yield scrapy.Request(next_url, callback=self.parse)

To track pagination, I had the spider keep track of the number of images on the page. The Kodak Pulse website does not dynamically load the thumbnails for the image, and it uses a static layout of images, so I knew that if there were less than 48 images per page (6 x 8), we were on the last page. If there might be another page to check, increment the page counter and yield a new Scrapy request to be parsed.

Next, I had to make sure that images would be processed. Being a bit lazy, I decided to leverage Scrapy’s built-in ImagesPipeline. The ImagesPipeline examines an image_url field in the items yielded by the scraper, and downloads the images to a specified image store. I enabled the ImagesPipeline by setting the following in settings.py.

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = 'images'  # directory in which to store the images

This made sure that Image items were handled as images and written to the path stored in the IMAGES_STORE

Lastly, I configured the ImageItem properly interface with the ImagesPipeline by adding an image_urls field to store the URL to the image, and an images field to store the image data.

class ImageItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

With all of the parts in place, I gave it a whirl. For this run, I expect 1,374 images to have been downloaded.

$ scrapy runspider spiders/KodakPulseSpider.py -a username=<USERNAME> -a password=<PASSWORD>
2017-02-26 14:08:16 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: scrapybot)
...
2017-02-26 14:07:36 [scrapy.core.engine] INFO: Closing spider (finished)
2017-02-26 14:07:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 418118,
 'downloader/request_count': 1407,
 'downloader/request_method_count/GET': 1406,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 171149061,
 'downloader/response_count': 1407,
 'downloader/response_status_count/200': 1404,
 'downloader/response_status_count/302': 3,
 'file_count': 1374,
 'file_status_count/downloaded': 1374,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 2, 26, 19, 7, 36, 812519),
 'item_scraped_count': 1374,
 'log_count/DEBUG': 4156,
 'log_count/INFO': 8,
 'request_depth_max': 29,
 'response_received_count': 1404,
 'scheduler/dequeued': 33,
 'scheduler/dequeued/memory': 33,
 'scheduler/enqueued': 33,
 'scheduler/enqueued/memory': 33,
 'start_time': datetime.datetime(2017, 2, 26, 19, 5, 40, 236626)}
2017-02-26 14:07:36 [scrapy.core.engine] INFO: Spider closed (finished)

According to the spider output, all of the images should have been downloaded. Let’s double check.

$ cd images/full
$ ls -la | wc -l
1374

All images accounted for and back in my control.

If you have images stored in Kodak Pulse and want them to be free, please checkout the kodak-pulse-downloader repository!