Scraping the Web in Python — IMDb
In this post, we scrape through IMDb web page and extract the name , runtime and genre of the movie using urllib and Beautiful soup libraries in Python.
For an understanding of HTML tags, refer the attached .
Layout of the page.
Filter movies by Released date between January’1950 and December’2012 ordered by “Number of Votes” descending.
Results home page comprises of 50 movies out of a total of 3,711,876 .
Inspect the HTML and locate the data of interest i.e name, runtime and genre .
Import the libraries
Request data from the URL and dump the HTML into a variable page_html
Extract the value from the tags and direct the output to a . CSV file
Parse the HTML dump, iterate through the items and extract the name, year and runtime tags. Direct the output to a .CSV file
Check the contents of the .csv file and it should have the data