Discussion:
[$Bill] extract images from website
Add Reply
Ammammata
2024-04-12 07:49:34 UTC
Reply
Permalink
http://dlib.coninet.it/bookreader.php?&c=1&f=7664&p=7#page/1/mode/1up

this is just an example, it's a "only sports" newspaper
it looks like the standard access is sort of blocked by a mandatory
registration, but having the link you can actually get the scans

I browse it manually changing the parameter in the link: f=<issue>
since it lacks a previous/next button

I'd like to know whether it's possible to extract all the scanned pages
with a batch file, saving them on my disk, for a faster data search

'f' goes from 1 to 14000, with several unused numbers

pages are shown at 25%, 50%, 100% and 200%, the latter would be the
better (I presume the image is always the same and the site resizes it
on the fly)

any help is appreciated :)

Go Lakers! who cares about the final standings? being in the play-in
does matter, then you must win them all...
--
/-\ /\/\ /\/\ /-\ /\/\ /\/\ /-\ T /-\
-=- -=- -=- -=- -=- -=- -=- -=- - -=-
........... [ al lavoro ] ...........
Ammammata
2024-04-27 23:12:18 UTC
Reply
Permalink
Il giorno Fri 12 Apr 2024 09:49:34a, *Ammammata* ha inviato su
Post by Ammammata
http://dlib.coninet.it/bookreader.php?&c=1&f=7664&p=7#page/1/mode/1
up
this is just an example, it's a "only sports" newspaper
it looks like the standard access is sort of blocked by a
mandatory registration, but having the link you can actually get
the scans
I browse it manually changing the parameter in the link: f=<issue>
since it lacks a previous/next button
I'd like to know whether it's possible to extract all the scanned
pages with a batch file, saving them on my disk, for a faster data
search
'f' goes from 1 to 14000, with several unused numbers
pages are shown at 25%, 50%, 100% and 200%, the latter would be
the better (I presume the image is always the same and the site
resizes it on the fly)
any help is appreciated :)
Go Lakers! who cares about the final standings? being in the
play-in does matter, then you must win them all...
$Bill wrote:

The image URLs look like this:
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_178_000
1
Not sure how you'd have to loop through them and ignore the empty ones.


Hi

the sample page corresponds to this link:

http://dlib.coninet.it/bookreader.php?&c=1&f=441#page/1/mode/1up

0001 should be the publication code
1929 is the year
178 is the issue
0001 is the page, I presume

I will create a test list of links, with ONE year (1929), all the
available issues (from 1 to 178) and some pages (say 1 to 8, those
missing will return an error), then WGET will run the task

thank you for spotting the images links :)
--
/-\ /\/\ /\/\ /-\ /\/\ /\/\ /-\ T /-\
-=- -=- -=- -=- -=- -=- -=- -=- - -=-
........... [ al lavoro ] ...........
Ammammata
2024-04-27 23:35:58 UTC
Reply
Permalink
Il giorno Sun 28 Apr 2024 01:12:18a, *Ammammata* ha inviato su
Post by Ammammata
I will create a test list of links
some Excel VBA:

Sub create_links()

' sample link:
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_178_000
1

Dim g, y, i, p As Integer
Dim g1, y1, i1, p1 As String
Dim image, link As String

' https://www.automateexcel.com/vba/write-to-text-file/
Dim FileName As String
FileName = "c:\BibDig\1953-54\test.txt"

Open FileName For Output As #1

For g = 1 To 1
g1 = Format(g, "0000")
For y = 1929 To 1929
y1 = Format(y, "0000")
For i = 1 To 178
i1 = Format(i, "000")
Debug.Print i1 ' just to check it's working... ;-)
For p = 1 To 8
p1 = Format(p, "0000")
image = g1 + "_" + y1 + "_" + i1 + "_" + p1
link = "http://dlib.coninet.it/view_foto_inside.php?tipo=o&id="
+ image
Print #1, link
Next p
Next i
Next y
Next g

Close #1

End Sub



I ran the above code and got a text file like:

http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_001_000
1
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_001_000
2
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_001_000
3
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_001_000
4
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_001_000
5
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_001_000
6
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_001_000
7
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_001_000
8
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_002_000
1
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_002_000
2
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_002_000
3
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_002_000
4
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_002_000
5
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_002_000
6
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_002_000
7
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_002_000
8
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_003_000
1
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_003_000
2
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_003_000
3
http://dlib.coninet.it/view_foto_inside.php?tipo=o&id=0001_1929_003_000
4


then wget started downloading: missing images are two bytes long and
contain "no" but the rest is fine (I'll rename the downloaded files):

***@tipo=o&id=0001_1929_051_0001 2 28/04/2024
01:27 -a--
***@tipo=o&id=0001_1929_051_0002 2 28/04/2024
01:27 -a--
***@tipo=o&id=0001_1929_051_0003 2 28/04/2024
01:27 -a--
***@tipo=o&id=0001_1929_051_0004 2 28/04/2024
01:27 -a--
***@tipo=o&id=0001_1929_051_0005 2 28/04/2024
01:27 -a--
***@tipo=o&id=0001_1929_051_0006 2 28/04/2024
01:27 -a--
***@tipo=o&id=0001_1929_051_0007 2 28/04/2024
01:27 -a--
***@tipo=o&id=0001_1929_051_0008 2 28/04/2024
01:27 -a--
***@tipo=o&id=0001_1929_052_0001 2,248,887 28/04
/2024 01:27 -a--
***@tipo=o&id=0001_1929_052_0002 2,257,880 28/04
/2024 01:27 -a--
***@tipo=o&id=0001_1929_052_0003 2,336,223 28/04
/2024 01:27 -a--
***@tipo=o&id=0001_1929_052_0004 2,264,053 28/04
/2024 01:27 -a--
***@tipo=o&id=0001_1929_052_0005 2 28/04/2024
01:27 -a--
***@tipo=o&id=0001_1929_052_0006 2 28/04/2024
01:27 -a--
***@tipo=o&id=0001_1929_052_0007 2 28/04/2024
01:27 -a--
***@tipo=o&id=0001_1929_052_0008 2 28/04/2024
01:27 -a--
***@tipo=o&id=0001_1929_053_0001 2,148,825 28/04
/2024 01:27 -a--
***@tipo=o&id=0001_1929_053_0002 2,157,392 28/04
/2024 01:28 -a--
***@tipo=o&id=0001_1929_053_0003 2,374,169 28/04
/2024 01:28 -a--
***@tipo=o&id=0001_1929_053_0004 2,366,378 28/04
/2024 01:28 -a--
***@tipo=o&id=0001_1929_053_0005 2,403,519 28/04
/2024 01:28 -a--
***@tipo=o&id=0001_1929_053_0006 2,310,746 28/04
/2024 01:28 -a--
***@tipo=o&id=0001_1929_053_0007 1,913,336 28/04
/2024 01:28 -a--
***@tipo=o&id=0001_1929_053_0008 2,113,558 28/04
/2024 01:28 -a--


ok, it's 1.30 AM, time to go to bed :)

next step: identify better the meaning of the digits 0001_1929_178_0001
--
/-\ /\/\ /\/\ /-\ /\/\ /\/\ /-\ T /-\
-=- -=- -=- -=- -=- -=- -=- -=- - -=-
........... [ al lavoro ] ...........
Loading...