Foray into Internet Telecom: VOIP, WebRTC, WebSockets, SIP, SDP, RTP, et al
I've been tinkering with VOIP technologies over the last few weeks in an attempt to find a stand-alone Java or C#
client that can speak SIP over WebSockets and negotiate media streams over WebRTC. In my wanderings I found out that nothing like that exists today so I tried my hand at putting one together.
As I got further and further down the rabbit hole I became acquainted with the host of technologies, protocols and specifications that all needed to interlock to some degree to make internet based telecom happen. By the time I had finished with my survey and had the beginnings of a working-prototype, I had found an alternate path for my project that was 'good enough'
It makes sense for me to do a brain dump here so I can come back and refresh my memory if I want to try this type of thing in the future.
===
I've been tinkering with VOIP technologies over the last few weeks in an attempt to find a stand-alone Java or C#
client that can speak SIP over WebSockets and negotiate media streams over WebRTC. In my wanderings I found out that nothing like that exists today so I tried my hand at putting one together.
As I got further and further down the rabbit hole I became acquainted with the host of technologies, protocols and specifications that all needed to interlock to some degree to make internet based telecom happen. By the time I had finished with my survey and had the beginnings of a working-prototype, I had found an alternate path for my project that was 'good enough'
It makes sense for me to do a brain dump here so I can come back and refresh my memory if I want to try this type of thing in the future.
References
- SIP & SDP
- Session Initiation Protocol [wikipedia.org]
- Session Description Protocol [wikipedia.org]
- Session Description Protocol RFC [tools.ietf.org]
- RTP, SRTP, RTSP and RTCP
- What is the relationship between RTP, RTCP and RTSP? [cs.columbia.edu]
- Internet Protocols for Real-Time Multimedia Communication [cse.wustl.edu]
- WebRTC
- WebRTC FAQ [webrtc.org]
- WebRTC MUST implement DTLS-SRTP [webrtchacks.com]
- WebSockets
- What is the fundamental difference between WebSockets and pure TCP? [stackoverflow.com]
- Differences between TCP sockets and web sockets, one more time [stackoverflow.com]
- Software & Libraries
- Jain-SIP (aka JSip)
- JSip project page [java.net]
- How do I configure JAIN-SIP client to send traffic to a non-standard port? [stackoverflow.com]
- Asterisk/JAIN-SIP why do I need to authenticate several times? [stackoverflow.com]
- SIP Register Request using JAIN SIP [vkslabs.com]
- Android Jain Sip - Sip Registration? [stackoverflow.com]
- Jain-Sip Authentication [stackoverflow.com]
- Kamailio [kamailio.org]
- FreeSWITCH [freeswitch.org]
- Asterisk [asterisk.org]
What Got Me Started?
I changed my home phone provider from Skype to Plivo as Skype has been atrophying since the acquisition by
Microsoft. We wanted to employ a 'smarter' home phone system and needed a system with robust API to make that happen. Once we ported our number out of Skype and into Plivo, I noticed that Plivo did not support Secure
connections for SoftPhones. Whoops! They did provide a WebRTC soft-phone which allows for secure connections and calls, though, so I decided to trace how it worked with the goal of building a stand-alone phone app that gives me the same security.
Long story short: This was a couple months ago and in the mean-time they rolled out TLS protection for the SIP
signalling, which at least prevents people from performing telecom fraud against me. They also have things setup where I can do out-bound secure calls, but inbound calls are insecure. This ended up being 'good enough' for our home-phone setup since we weren't getting call privacy with Skype and wouldn't get it with a land-line anyway.
The Beginning
When I started looking into VOIP and WebRTC I didn't realize the different components that are currently needed to get things to talk to each other. I went in thinking that WebRTC used WebSockets and that it had built-in signalling for establishing calls and the like. My research went much better after I realized that WebRTC and WebSockets are about as similar as Java and JavaScript.
At a high-level here are the technologies & protocols I read about:
- WebRTC
- RTP
- SRTP
- RTSP
- RTCP
- DTLS
- SIP/SIPS
- SDP
- WebSocket/Secure Web Sockets
- ICE
- STUN
What do each of these technologies do?
Each technology listed above has a particular role to play in a VOIP ecosystem. Here's a brief description of each one:
SIP: Stands for 'Session Initiation Protocol'. Endpoints can use SIP signalling to invite other endpoints to connect. SIP does not route media like audio or video, it just provides a way for SoftPhones and other endpoints to indicate intentions to each other. For SIP to work it requires endpoints to connect to a server. There is a Secure-variant known as SIPS which can use SSL or TLS to secure the session.
SDP: It's a TLA for 'Session Description Protocol'. SDP is needed because signalling protocols like SIP don't have any idea about media streams. SDP lets endpoints negotiate the types of audio/video/media streams that can be supported for a call between SoftPhones and other endpoints. SDP Payloads can be delivered over SIP.
RTP: The 'Real-time Transport Protocol' is designed to deliver audio and video streams over networks. This is what actually lets you make a call and talk to someone on the other side. For RTP to work, a socket has to be established with whatever device/software is sending or receiving the media stream. There is a 'secure' variant available known as SRTP.
DTLS: Stands for 'Datagram Transport Layer Security'. It is similar to TLS, except it can be used for UDP or datagram connections. It is one mechanism that can be used to protect RTP over UDP streams.
RTCP: 'Real Time Control Protocol'. This protocol does not transport any media stream data. It provides 'out of band' statistics and control for an RTP session. It is useful to help ensure the Quality of Service of an RTP stream. For this to work it must establish a separate socket to an endpoint that consumes the RTCP data. There is a secure variant of this known as SRTCP.
RTSP: 'Real Time Streaming Protocol'. This one had me confused for awhile since I'd get it mixed up with SRTP (Secure version of RTP). This protocol is used to establish and control media sessions and allows for commands like 'play', 'pause' and 'rewind'. RTSP is not required to deliver the media stream data itself, though. I only mention this here because it is a similarly named protocol that caused confusion when I was learning about the other underlying technologies used in VOIP.
STUN: 'Session Traversal Utilities for NAT'. This is a lightweight client-server protocol which lets real-time voice, video and messaging traverse NAT. Since NAT is very common at consumer endpoints, this is a useful technology.
ICE: 'Interactive Connectivity Establishment'. This is developed by the IETF and defines methods which allow for NAT traversal by peer-to-peer and multimedia data, streams and sessions.
WebRTC: Designed to provide Web Browsers with an easy way to establish 'Real Time Communication' with other browsers. As implemented by web browsers, it provides a simple JavaScript API which allows you to easily add remote audio or video calling to your web page or web app. As a set of technologies, it can be used outside of the browser even though that use case is not common yet.
WebSockets: A WebSocket is a lightweight encapsulation of TCP that makes working with the socket simpler. In the common use case (browsers) this makes it easier for JavaScript client code to work with sockets. WebSocket is not part of the HTTP specification and the only relationship between the 2 is an 'upgrade' message which can be sent over http to inform the webserver that the traffic is socket based and not HTTP. The nice thing about WebSocket is that you don't really have to be concerned about which 'ports' are open between you and the target server/system since you aren't connecting to some screwball port. WebSockets can be secured by TLS.
In summary, There are at least 3 active socket connections that are needed for this system to work: SIP/SIPS,
RTP/SRTP and RTCP/SRTCP. There is a lot more than just open a socket and dump audio-video data here!
How do these technologies fit together?
In order to successfully make VOIP or multimedia calls across the internet, you'll probably be dealing with NAT. So, you'll need to have a STUN server out there to help your SIP endpoints traverse their NAT for real-time calls.
You'll need a SIP server which all your endpoints register to so they can negotiate media streams using SDP. In order to actually make calls, your clients will need to know how to speak RTP/RTCP (or their secure forms).
My Attempted Implementation + The Future
I started implementation of a Java client that would be able to talk to Plivo's WebSocket-SIP server using JSip (aka Jain-SIP). Using the information listed in the References section at the top of the article, I was able to get Java code that could register to their websocket server, REGISTER and attempt an INVITE.
The other half of the equation consisted of me downloading and compiling Google's WebRTC code on a Linux VM. I was going to use that to handle the media streams and things like that. I didn't get too far down this path as I was focused largely on learning about the underlying technologies and protocols involved with VOIP in general.
I certainly learned a lot about these technologies. Maybe once they bounce around in my head for awhile I can come back and successfully implement my Stand-alone client. On the surface these technologies make sense (even if there are a lot of them). The difficult part in implementing a client is that you need to understand how it all fits together. Getting a complete, working model of all that in my head at once is problematic right now given the other projects I'm also a part of.
Sample SIP + SDP From Plivo
This is a SIP + SDP message that came through one of the Soft phones I was testing. It shows that the incoming calls from Plivo are only offering the udp insecure transport. I'm placing it here to jog my memory if I come back to this project later.
message: channel [07CD3218]: received [1391] new bytes from [TLS://phone.plivo.com:5061]: INVITE sip:mysoftphone17@174.52.5.6:38331;transport=tls SIP/2.0 Record-Route: Record-Route: Via: SIP/2.0/TLS 54.241.2.206:5061;branch=z9hG4bKb388.e9144db53b5fa4ba82b6bd52b4def4f4.0 Via: SIP/2.0/UDP 191.201.61.52:5080;received=191.201.61.52;rport=5080;branch=z9hG4bKpU0yXc4v5eXNr Max-Forwards: 40 From: "+15002309140" ;tag=Hygr09ggKmpjm To: Call-ID: 333b9ea8-3e5e-1233-4bb9-782bcb0446c4 CSeq: 72472445 INVITE Contact: User-Agent: Plivo Allow: INVITE, ACK, BYE, CANCEL, OPTIONS, MESSAGE, INFO, UPDATE, REFER, NOTIFY Supported: timer, precondition, path, replaces Allow-Events: talk, hold, conference, refer Privacy: none Content-Type: application/sdp Content-Disposition: session Content-Length: 260 X-FS-Support: update_display,send_info P-Asserted-Identity: "+15002309140" P-hint: outbound v=0 o=FreeSWITCH 1425589313 1425589314 IN IP4 191.201.61.52 s=FreeSWITCH c=IN IP4 191.201.61.52 t=0 0 m=audio 27706 RTP/AVP 0 8 9 98 3 101 a=rtpmap:98 SPEEX/8000 a=rtpmap:101 telephone-event/8000 a=fmtp:101 0-16 a=silenceSupp:off - - - - a=ptime:20 message: channel [07CD3218] [1131] bytes parsed message: Changing [server] [INVITE] transaction [07DCF590], from state [INIT] to [PROCEEDING] message: channel [07CD3218]: message sent to [TLS://phone.plivo.com:5061], size: [439] bytes SIP/2.0 100 Trying Via: SIP/2.0/TLS 54.241.2.206:5061;branch=z9hG4bKb388.e9144db53b5fa4ba82b6bd52b4def4f4.0 Via: SIP/2.0/UDP 191.201.61.52:5080;received=191.201.61.52;rport=5080;branch=z9hG4bKpU0yXc4v5eXNr From: "+15002309140" ;tag=Hygr09ggKmpjm To: Call-ID: 333b9ea8-3e5e-1233-4bb9-782bcb0446c4 CSeq: 72472445 INVITE Content-Length: 0 message: New server dialog [03CE5368] , local tag [], remote tag [Hygr09ggKmpjm] message: Dialog set from [00000000] to [03CE5368] for op [02F8DBA8] message: new incoming call from ["+15002309140" ] to [] message: Found payload PCMU/8000 fmtp= message: Found payload PCMA/8000 fmtp= message: Found payload G722/8000 fmtp= message: Found payload SPEEX/8000 fmtp= message: Found payload GSM/8000 fmtp= message: Found payload telephone-event/8000 fmtp=0-16 warning: searching for already_a_call_with_remote_address. message: Doing SDP offer/answer process of type incoming message: No match for G722/8000 message: No match for GSM/8000 message: channel [07CD3218]: message sent to [TLS://phone.plivo.com:5061], size: [541] bytes SIP/2.0 488 Not acceptable here Via: SIP/2.0/TLS 54.241.2.206:5061;branch=z9hG4bKb388.e9144db53b5fa4ba82b6bd52b4def4f4.0 Via: SIP/2.0/UDP 191.201.61.52:5080;received=191.201.61.52;rport=5080;branch=z9hG4bKpU0yXc4v5eXNr From: "+15002309140" ;tag=Hygr09ggKmpjm To: ;tag=XC7r0HZ Call-ID: 333b9ea8-3e5e-1233-4bb9-782bcb0446c4 CSeq: 72472445 INVITE User-Agent: Linphone/3.7.0 (belle-sip/1.3.0) Supported: replaces, outbound Content-Length: 0 message: Changing [server] [INVITE] transaction [07DCF590], from state [PROCEEDING] to [COMPLETED] message: Dialog [03CE5368]: now updated by transaction [07DCF590]. message: Dialog [03CE5368] terminated for op [02F8DBA8] message: channel [07CD3218]: received [420] new bytes from [TLS://phone.plivo.com:5061]: ACK sip:mysoftphone17@174.52.5.6:38331;transport=tls SIP/2.0 Via: SIP/2.0/TLS 54.241.2.206:5061;branch=z9hG4bKb388.e9144db53b5fa4ba82b6bd52b4def4f4.0 Max-Forwards: 40 From: "+15002309140" ;tag=Hygr09ggKmpjm To: ;tag=XC7r0HZ Call-ID: 333b9ea8-3e5e-1233-4bb9-782bcb0446c4 CSeq: 72472445 ACK Content-Length: 0