The Great Photo Escape: Freeing Images from Kodak's Digital Prison
Starting in 2011, Kodak brought to market the Kodak Pulse line of digital photo frames. In addition to SD card and USB support, this line of photo frames had an email address which could receive image attachments, store the images on Kodak’s servers, and display the images hassle free on the digital photo frame. While this feature is a boon for people who like to receive photos from friends and family with minimal latency, there is one very important feature missing – the ability to download these images in bulk.
While it was possible to manually download each image using a web browser, this is not an acceptable means for backing up images, particularly if your photo album contains thousands of photographs. Instead of trying to download all of these images manually, I decided that a programmatic solution must exist. To that end, I created a bulk image crawling script using the python library scrapy.
In order for my scraping plan to work, I needed to make sure that the website used provided some identifying information for each image. When viewing the photo album, the website defaults to showing a table of thumbnails. Each thumbnail is linked to a CloudFront cache of the image with a unique URL for each thumbnail.
<td align="center" valign="middle">
<img
src="http://d2e6lg84of5d93.cloudfront.net/26~t~8ab5b353-73d0-47ea-95a7-12b09521ce09~636155450392171000.jpg"
alt="0"
id="picture_8ab5b353-73d0-47ea-95a7-12b09521ce09"
rot="RotateNone"
style="display: inline;"
/>
<canvas id="canvas_0" style="display:none;"></canvas>
</td>
When you hover over the image, the thumbnail performs an asynchronous call to load the full-size image, which is downloaded. In the case of the example image above, the full-size image was also stored in CloudFront
http://d2e6lg84of5d93.cloudfront.net/26~a~8ab5b353-73d0-47ea-95a7-12b09521ce09~636155450392171000.jpg
Initially, I thought that the similarity may have been a fluke, but as it happens the thumbnail and original images have nearly identical URLs; only the a/t characters change.
http://d2e6lg84of5d93.cloudfront.net/26~t~8ab5b353-73d0-47ea-95a7-12b09521ce09~636155450392171000.jpg
http://d2e6lg84of5d93.cloudfront.net/26~a~8ab5b353-73d0-47ea-95a7-12b09521ce09~636155450392171000.jpg
With this information in hand, I knew that I could iterate through each thumbnail
on the page, extract the URL, change the t
to an a
and have a URL to the original
image. From here, I started working on the scraper.
Setting up the initial boiler plate for the scraper was relatively straightforward. First, I created a virtual environment and installed scrapy.
virtualenv venv --python=python3
source venv/bin/activate
pip install scrapy
Then I created a typical Scrapy directory structure.
kodak-pulse-downloader
├── items.py
├── scrapy.cfg
├── settings.py
├── spiders
│ ├── KodakPulseSpider.py
This directory structure has four primary files. First, there is
KodakPulseSpider.py
, which contains all of my scraping logic. Next is items.py
,
which is a class file that would contain a Scrapy Item
to represent the images
I was trying to download. Lastly, I have scrapy.cfg
and settings.py
, which
describe Scrapy configuration and spider configuration respectively.
In Scrapy, there are three main components:
- Spiders, which parse the HTML content served,
- Items, which are object wrappers for data being scrapped, and
- Pipelines, which handle data cleansing and storage for each item parsed.
Let’s start by looking at the KodakPulseSpider.py
file. The Scraper has three
main steps. First, it must login to the Kodak Pulse website. Then, it must determine
how many pages of images are available, and lastly, it must identify each image
and formulate the canonical URL to the original image.
I began by creating a new scrapy.Spider
class. By default, Scrapy wants to know
the name of the spider, and the spider needs to know the URL to Kodak Pulse.
class KodakPulseSpider(scrapy.Spider):
name = "KodakPulseSpider"
base_url = "https://www.kodakpulse.com/"
Next, I created a start_request
to initialize the spider with an authentication
request. The base URL used a form POST to authenticate with the service, so
the Scrapy FormRequest
method was this leveraged to log in. This method
uses a callback to continue with the scraping process when login has completed,
so I created an intermediate logged_in
method which navigates the spider to the
picture viewing page.
class KodakPulseSpider(scrapy.Spider):
name = "KodakPulseSpider"
base_url = "https://www.kodakpulse.com/"
self.page = 1
def start_requests(self):
"""This function is called before crawling starts."""
# Login
return [
scrapy.FormRequest(self.base_url,
formdata={
'username': self.username, # Set to the username
'password': self.password # Set to the password
},
method="post",
callback=self.logged_in)]
def logged_in(self, response):
return scrapy.Request(self.base_url + '/Frame/AllPictures')
With the basics of authentication out of the way, I implemented the spider’s
aptly-named default parsing method, parse
. I wanted this method to do two
things. It must find, transform, and marshal all images with a CloudFront URL,
and keep track of the page number.
def parse(self, response):
images = response.css('img').xpath('@src').extract()
image_urls = [image.replace('~t~', '~a~') for image in images
if "cloudfront" in image]
yield ImageItem(image_urls=image_urls)
Finding all image URLs was a straightforward task using Scrapy’s xpath extraction API. Once a list of URLs was generated, I had the spider iterate through each URL verifying that the URL resolved to CloudFront, and transformed the thumbnail image into the original image URL. This URL was then used to create an ImageItem.
def parse(self, response):
images = response.css('img').xpath('@src').extract()
image_urls = [image.replace('~t~', '~a~') for image in images
if "cloudfront" in image]
yield ImageItem(image_urls=image_urls)
if len(images) < 48:
return
self.page += 1
next_url = "{base}/Frame/AllPictures?p={page}".format(base=base_url,
page=self.page)
yield scrapy.Request(next_url, callback=self.parse)
To track pagination, I had the spider keep track of the number of images on the page. The Kodak Pulse website does not dynamically load the thumbnails for the image, and it uses a static layout of images, so I knew that if there were less than 48 images per page (6 x 8), we were on the last page. If there might be another page to check, increment the page counter and yield a new Scrapy request to be parsed.
Next, I had to make sure that images would be processed. Being a bit lazy, I
decided to leverage Scrapy’s built-in ImagesPipeline
. The ImagesPipeline
examines an image_url
field in the items yielded by the scraper, and downloads
the images to a specified image store. I enabled the ImagesPipeline
by setting
the following in settings.py
.
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = 'images' # directory in which to store the images
This made sure that Image items were handled as images and written to the
path stored in the IMAGES_STORE
Lastly, I configured the ImageItem
properly interface with the ImagesPipeline
by adding an image_urls
field to store the URL to the image, and an images
field to store the image data.
class ImageItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
With all of the parts in place, I gave it a whirl. For this run, I expect 1,374 images to have been downloaded.
$ scrapy runspider spiders/KodakPulseSpider.py -a username=<USERNAME> -a password=<PASSWORD>
2017-02-26 14:08:16 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: scrapybot)
...
2017-02-26 14:07:36 [scrapy.core.engine] INFO: Closing spider (finished)
2017-02-26 14:07:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 418118,
'downloader/request_count': 1407,
'downloader/request_method_count/GET': 1406,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 171149061,
'downloader/response_count': 1407,
'downloader/response_status_count/200': 1404,
'downloader/response_status_count/302': 3,
'file_count': 1374,
'file_status_count/downloaded': 1374,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 2, 26, 19, 7, 36, 812519),
'item_scraped_count': 1374,
'log_count/DEBUG': 4156,
'log_count/INFO': 8,
'request_depth_max': 29,
'response_received_count': 1404,
'scheduler/dequeued': 33,
'scheduler/dequeued/memory': 33,
'scheduler/enqueued': 33,
'scheduler/enqueued/memory': 33,
'start_time': datetime.datetime(2017, 2, 26, 19, 5, 40, 236626)}
2017-02-26 14:07:36 [scrapy.core.engine] INFO: Spider closed (finished)
According to the spider output, all of the images should have been downloaded. Let’s double check.
$ cd images/full
$ ls -la | wc -l
1374
All images accounted for and back in my control.
If you have images stored in Kodak Pulse and want them to be free, please checkout the kodak-pulse-downloader repository!