October 2021. This is a personal project - a programming exercise in web scraping, content aggregation, and classification.
I use web sites and take:
  • semantic data
  • rss feeds
  • embedded content
  • images
  • video
  • audio
  • text
  • raw html
  • keywords
and uploads of:
  • images
  • video
  • audio

Retrieving Content

The user interface for retrieving data (as outlined above) uses a drag/drop and copy/paste interface. This works with three kinds of digital content sources...
3 types of content sources:
1. Platform content
2. User content
3. Web content

Type 1 - Platform content.
  • Source: these platfoms provide an HTML string under a 'Share' icon or the icon
    • To view these posts with full fidelity will require an account with that platform (e.g. twitter, instagram, alltrails, etc).
    • These are the embedded content platforms in use...

                                                                     

      All trademarks, logos and brand names are the property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names,trademarks and brands does not imply endorsement.


      Examples of embedded content posts:
       
       
       
       
       
       
       
       
       
       
       


  • RSS Feeds are found under the RSS icon
    • RSS feeds have recently been made relevant again by Google's impending 'Follow' option, so RSS feeds might be emerging from years of neglect. These can have incredible amounts of information and often have rich UI's.

Type 2 - User content from desk/phone
Drag/drop or copy/paste the following:
Type 3 - Web content (images/text/web page url/search)


Geek talk: Getting meaningful semantic information by parsing the structures of web pages


semantic: relating to meaning in language or logic.

Working with webpage semantic data

Semantic data is stored within HTML pages using several different methods. Typically more than one method is used within a page. These methods are:
  • ld+json
  • microdata
  • Open Graph
  • HTML header meta tags
  • HTML5 Semantic tags
Semantic data is defined using a framework stored at Schema.org. Schema.org is an organization which regulates and defines a heirarchy of semantically named items and structures.

All browsers that have broadly-based usage support semantic data.

Why use semantic data?
It is in the interest of web sites to include this semantic data if they wish to be found and used by their customers using search engines such as Google, Bing. It is in the interest of search engines to have this data so they can produce meaningful results for the user.

In addition, there are other companies who have services (i.e., not search engines) that use this data. Some examples of those where they directly use the semantic structure at schema.org:
  • an audio internet appliance that speaks recipe ingredients and instructions to you, and steps you through the process. Click here to see this semantic definition.
  • tourism and hospitality providers that return restaurant opening hours, a menu, location, access to reservations, customer ratings, price range, etc, click here
  • retail middle-men who have customers that are buying a product and want to compare similar products and prices, to see user ratings, user reviews, product attributes, photos, etc, click here
  • Bookclub applications click here

Working with semantic coding methods

ld+json

  • "ld" stands for "Linked Data". Within the HTML code of a web page this linked data, using named definitions from that schema, is stored in json format. It can be found within a "script" tag:
    <script type="application/ld+json">
    json data
    </script>
  • json is a protocol used to efficiently store structured data in a textual format.

Microdata

Microdata is a bit older than ld+json, it also is defined using a framework found at Schema.org. Microdata items are stored within HTML elements (i.e., <div>, <span>, <meta> etc.) inserted as an attribute named "itemprop". That attribute:
  • stores the name of the schema item, and
  • is accompanied by a "content" attribute which stores the value of the schema item.

Open Graph

Facebook's Open Graph stores its items within meta tags. When someone posts a web link to Facebook, Facebook will get at a minimum three of those meta tags from that link
When Facebook initially released Open Graph 50,000 web sites had added those meta tags within one week. This has become the de facto standard for most web sites. Almost all web sites now have at least these 3 of those metatags. Shows the economic and social power of semantics in linking items.

Meta tags

The oldest HTML tags for storing and exposing metadata within a web page are 'meta' tags. These are found in the HTML header. These <meta> HTML elements are typically used to store some information such as the character set used on the page, the keywords of the document, the viewport, the author, and many other provider-specific implementations (such as Facebook Open Graph tags.)

HTML Semantic tags


Other sources:

Some websites also provide 'keyword' tags embedded within their semantic data. These keywords can be single words or phrases of words. The app will take these and turn them into referential tags.


** A note about semantic data for product price and availability:
If you extract the price and availability for a product from a web page (both ld+json and microdata frameworks provide this information in a structured way) and intend to produce the data for your own purposes, please note that many businesses require that you cannot just get it once and let it remain static. You must be able to dynamically refresh that price and availability anytime you 'offer' it. Businesses such as Target, Amazon and Bestbuy require this, and they provide a web service to get that updated information. An example is Target's "Redsky" service.

These three things from a given web site:
  • Links
  • Images
  • Text
along with the embedded semantic information as explained above, are pretty much all of the worthwhile information there is on a web page that is not domain specific.


Important:
If an image, text or link on a web page is retrieved via the above frameworks or via RSS you can assume the owner of that webpage has decided to allow those assets to be freely distributed. Otherwise you must assume all other assets are copyrighted by the owner of that webpage, and their rights should be attributed.



I use a partial data download from the Microsoft Research web site for use in a 'tag' database, and that requires these two attributions....
Zhongyuan Wang, Haixun Wang, Ji-Rong Wen, and Yanghua Xiao, An Inference Approach to Basic Level of Categorization, in ACM International Conference on Information and Knowledge Management (CIKM), ACM – Association for Computing Machinery, October 2015.

Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Zhu, Probase: A Probabilistic Taxonomy for Text Understanding, in ACM International Conference on Management of Data (SIGMOD), May 2012.

Contact me